- Overview
- Postorder processing
- Key IR concepts
- SemIR textual format
- Core loop
- Types
- Expression categories
- Handling Parse::Tree errors (not yet implemented)
- Alternatives considered
Check takes the parse tree and generates a semantic intermediate representation, or SemIR. This will look closer to a series of instructions, in preparation for transformation to LLVM IR. Semantic analysis and type checking occurs during the production of SemIR. It also does any validation that requires context.
The checking step is oriented on postorder processing on the Parse::Tree
to
iterate through the Parse::NodeImpl
vectorized storage once, in order, as much
as possible. This is primarily for performance, but also relies on the
information accumulation principle:
that is, when that principle applies, we should be able to generate IR
immediately because we can rely on the principle that when a line is processed,
the information necessary to semantically check that line is already available.
Indirectly, what this really means is that we should be able to go from a Parse::Tree (which cannot be used for name lookups) to a SemIR with name lookups completed in a single pass. The SemIR should not need to be re-processed to add more information outside of templates. By doing this, we avoid an additional processing pass with associated storage needs.
This single-pass approach also means that the checking step does not make use of
the tree structure of the Parse::Tree
. In cases where the actions performed
for a parse tree node depend on the context in which that node appears, a node
that is visited earlier in the postorder traversal, such as a bracketing node,
needs to establish the necessary context. In this respect, the sequence of
Parse::Node
s can be thought of as a byte code input that the check step
interprets to build the SemIR
.
A SemIR::Inst
is the basic building block that represents a simple
instruction, such as an operator or declaring a literal. For each kind of
instruction, a typedef for that specific kind of instruction is provided in the
SemIR
namespace. For example, SemIR::Assign
represents an assignment
instruction, and SemIR::PointerType
represents a pointer type instruction.
Each instruction class has up to four public data members describing the instruction, as described in sem_ir/typed_insts.h (also see adding features for Check):
-
A
Parse::Node parse_node;
member that tracks its location is present on almost all instructions, except instructions likeSemIR::Builtin
that don't have an associated location. -
A
SemIR::TypeId type_id;
member that describes the type of the instruction is present on all instructions that produce a value. This includes namespace instructions, which are modeled as producing a value of "namespace" type, even though they can't be used as a first-class value in Carbon expressions. -
Up to two additional, kind-specific members. For example
SemIR::Assign
has membersInstId lhs_id
andInstId rhs_id
.
Instructions are stored as type-erased SemIR::Inst
objects, which store the
instruction kind and the (up to) four fields described above. This balances the
size of SemIR::Inst
against the overhead of indirection.
A SemIR::InstBlock
can represent a code block. However, it can also be created
when a series of instructions needs to be closely associated, such as a
parameter list.
A SemIR::Builtin
represents a language built-in, such as the unconstrained
facet type type
. We will also have built-in functions which would need to form
the implementation of some library types, such as i32
. Built-ins are in a
stable index across SemIR
instances.
Parameters and arguments will be stored as two SemIR::InstBlock
s each. The
first will contain the full IR, while the second will contain references to the
last instruction for each parameter or argument. The references block will have
a size equal to the number of parameters or arguments, allowing for quick size
comparisons and indexed access.
There are two textual ways to view SemIR
.
The raw form of SemIR shows the details of the representation, such as numeric
instruction and block IDs. The representation is intended to very closely match
the SemIR::File
and SemIR::Inst
representations. This can be useful when
debugging low-level issues with the SemIR
representation.
The driver will print this when passed --dump-raw-sem-ir
.
In addition to the raw form, there is a higher-level formatted IR that aims to
be human readable. This is used in most check
tests to validate the output,
and also expected to be used regularly by toolchain developers to inspect the
result of checking the parse tree.
The driver will print this when passed --dump-sem-ir
.
Unlike the raw form, certain representational choices in the SemIR
data may
not be visible in this form. However, it is intended to be possible to parse the
SemIR
output and form an equivalent – but not necessarily identical – SemIR
representation, although no such parser currently exists.
As an example, given the program:
fn Cond() -> bool;
fn Run() -> i32 { return if Cond() then 1 else 2; }
The formatted IR is currently:
constants {
%.1: i32 = int_literal 1 [template]
%.2: i32 = int_literal 2 [template]
}
file {
package: <namespace> = namespace [template] {
.Cond = %Cond
.Run = %Run
}
%Cond: <function> = fn_decl @Cond [template] {
%return.var.loc1: ref bool = var <return slot>
}
%Run: <function> = fn_decl @Run [template] {
%return.var.loc2: ref i32 = var <return slot>
}
}
fn @Cond() -> bool;
fn @Run() -> i32 {
!entry:
%Cond.ref: <function> = name_ref Cond, file.%Cond [template = file.%Cond]
%.loc2_33.1: init bool = call %Cond.ref()
%.loc2_26.1: bool = value_of_initializer %.loc2_33.1
%.loc2_33.2: bool = converted %.loc2_33.1, %.loc2_26.1
if %.loc2_33.2 br !if.expr.then else br !if.expr.else
!if.expr.then:
%.loc2_41: i32 = int_literal 1 [template = constants.%.1]
br !if.expr.result(%.loc2_41)
!if.expr.else:
%.loc2_48: i32 = int_literal 2 [template = constants.%.2]
br !if.expr.result(%.loc2_48)
!if.expr.result:
%.loc2_26.2: i32 = block_arg !if.expr.result
return %.loc2_26.2
}
There are three kinds of names in formatted IR, which are distinguished by their leading sigils:
-
%name
denotes a value produced by an instruction. These names are introduced by a line of the form%name: <category> <type> = <instruction>
, and are scoped to the enclosing top-level entity.<category>
describes the expression category, which isinit
for an initializing expression,ref
for a reference expression, or omitted for a value expression. Typically, values can only be referenced by instructions that their introduction dominates, but some kinds of instruction might have other rules. Names in thefile
block can be referenced asfile.%<name>
. -
!name
denotes a label, and!name:
appears as a prefix of eachInstBlock
in aFunction
. These names are scoped to their enclosing function, and can be referenced anywhere in that function, but not outside. -
@name
denotes a top-level entity, such as a function, class, or interface. The SemIR view of these entities is flattened, so member functions are treated as top-level entities.
Names in formatted IR are all invented by the formatter, and generally are of
the form <base_name>[.loc<line>[_<col>[.<counter>]]]
where <line>
and
<col>
describe the location of the instruction, and <counter>
is used as a
disambiguator if multiple instructions appear at the same location. Trailing
name components are only included if they are necessary to disambiguate the
name. <base_name>
is a guessed good name for the instruction, often derived
from source-level identifiers, and is empty if no guess was made.
There is usually one line in a InstBlock
for each Inst
. You can find the
documentation for the different kinds of instructions in
toolchain/sem_ir/typed_insts.h. For example,
given a formatted SemIR line like:
%N: i32 = assoc_const_decl N [template]
you would look for a struct
definition that uses "assoc_const_decl"
as its
ir_name
. In this case, this is the AssociatedConstantDecl
instruction:
// An associated constant declaration in an interface, such as `let T:! type;`.
struct AssociatedConstantDecl {
static constexpr auto Kind =
InstKind::AssociatedConstantDecl.Define<Parse::NodeId>(
{.ir_name = "assoc_const_decl", .is_lowered = false});
TypeId type_id;
NameId name_id;
};
Since this instruction produces a value, it has a TypeId type_id
field, which
corresponds to the type written between the :
and the =
. In the example
above, that type is i32
. The other arguments to the instruction are written
after the ir_name
-- in this example the name_id
is N
. From this we find
that the instruction corresponds to an associated constant declaration in an
interface like let N:! i32;
.
Instructions producing a constant value, like assoc_const_decl
above, are
followed by their phase, either [symbolic]
or [template]
, and then =
the
value if it is the value of a different instruction.
Instructions that do not produce a value, such as the br
and return
instructions above, omit the leading %name: ... =
prefix, as they cannot be
named by other instructions. These instructions do not have a TypeId type_id
field, like the AdaptDecl
instruction:
// An adapted type declaration in a class, of the form `adapt T;`.
struct AdaptDecl {
static constexpr auto Kind = InstKind::AdaptDecl.Define<Parse::AdaptDeclId>(
{.ir_name = "adapt_decl", .is_lowered = false});
// No type_id; this is not a value.
TypeId adapted_type_id;
};
An adapt SomeClass;
declaration would have the corresponding SemIR formatted
as:
adapt_decl %SomeClass
Some instructions have special argument handling. For example, some invalid
arguments will be omitted. Or an InstBlockId
argument will be rendered inline,
commonly enclosed in braces {
...}
or parens (
...)
. In other cases, the
formatter will combine instructions together to make the IR more readable:
- A terminator sequence in a block, comprising a sequence of
BranchIf
instructions followed by aBranch
orBranchWithArg
instruction, is collapsed into a singleif %cond br !label1 else if ... else br !labelN(%arg)
line. - A struct type, formed by a sequence of
StructTypeField
instructions followed by aStructType
instruction, is collapsed into a singlestruct_type{.field1: %value1, ..., .fieldN: %valueN}
line.
These exceptions may be found in toolchain/sem_ir/formatter.cpp.
Question: Are these too in flux to document at this time?
constants
: TODOimports
: TODOfile
: TODO- entities
- TODO: may be preceded by
extern
. - TODO: may be preceded by
generic
.- These may have an optional
!definition:
section containing the generic'sdefinition_block_id
.
- These may have an optional
fn
: TODO; followed by= "
..."
for builtinsclass
: TODOinterface
: TODOimpl
: TODO
- TODO: may be preceded by
specific
: TODO- body in braces
{
...}
has a bunch of `` => ` assignment lines - The first lines of the body describe the declaration
- If there is a valid definition, there are additional definition
assignments after a
!definition:
line.
- body in braces
The core loop is Check::CheckParseTree
. This loops through the Parse::Tree
and calls a Handle
... function corresponding to the NodeKind
of each node.
Communication between these functions for different nodes working together is
through the Context
object defined in
check/context.h, which stores things in a
collection of stacks. The common pattern is that the children of a node are
processed first. They produce information that is then consumed when processing
the parent node.
One example of this pattern is expressions. Each subexpression outputs SemIR
instructions to compute the value of that subexpression to the current
instruction block, added to the top of the InstBlockStack
stored in the
Context
object. It leaves an instruction id on the top of the
node stack pointing to the instruction that produces the value of
that subexpression. Those are consumed by parent operations, like an
RPN calculator. For
example, the expression 1 * 2 + 3
corresponds to this parse tree:
{kind: 'IntegerLiteral', text: '1'},
{kind: 'IntegerLiteral', text: '2'},
{kind: 'InfixOperator', text: '*', subtree_size: 3},
{kind: 'IntegerLiteral', text: '3'},
{kind: 'InfixOperator', text: '+', subtree_size: 5},
This parse tree is processed by one call to a Handle
function per node:
-
The first node is an integer literal, so the core loop calls
HandleIntegerLiteral
.- It calls
context::AddInstAndPush
to output aSemIR::IntegerLiteral
instruction to the current instruction block, and pushes the parse node along with the instruction id to the node stack.
- It calls
-
The second node is also an integer literal, which outputs a second instruction and pushes another entry onto the node stack.
-
HandleInfixOperator
pops the two entries off of the node stack, outputs any conversion instructions that are needed, and usescontext::AddInstAndPush
to create and push the instruction id representing the output of a multiplication instruction. That multiplication instruction takes the instruction ids it popped off the stack at the beginning as arguments. -
Another integer literal instruction is created for
3
and pushed onto the stack. -
HandleInfixOperator
is called again. It pops the two instruction ids off the stack to use as the arguments to the multiplication instruction it creates and pushes.
In this way, the handle functions coordinate producing their output using the instruction block stack and node block stack from the context.
A similar pattern uses bracketing nodes to support parent nodes that can have a
variable number of children. For example, a return
statement can produce parse
trees following a few different patterns:
-
return;
{kind: 'ReturnStatementStart', text: 'return'}, {kind: 'ReturnStatement', text: ';', subtree_size: 2},
-
return x;
{kind: 'ReturnStatementStart', text: 'return'}, {kind: 'NameExpr', text: 'x'}, {kind: 'ReturnStatement', text: ';', subtree_size: 3},
-
return var;
{kind: 'ReturnStatementStart', text: 'return'}, {kind: 'ReturnVarModifier', text: 'var'}, {kind: 'ReturnStatement', text: ';', subtree_size: 3},
In all three cases, the introducer node ReturnStatementStart
pushes an entry
on the node stack with just the parse node and no id, called a
solo parse node. The handler for the parent ReturnStatement
node can pop and
process entries from the node stack until it finds that solo parse node from
ReturnStatementStart
that indicates it is done.
Another pattern that arises is state is set up by an introducer node, updated by its siblings, and then consumed by the bracketing parent node. FIXME: example
The node stack, defined in check/node_stack.h,
stores pairs of a Parse::Node
and an id. The type of the id is determined by
the NodeKind
of the parse node. It is the default, general-purpose stack used
by Handle
... functions in the check stage. Using a single stack is beneficial
since it improves locality of reference and reduces allocations. However,
additional stacks are used to ensure we never need to search through the stack
to find data -- we always want to be operating on the top of the stack (or a
fixed offset).
The node stack contains any state pushed by siblings of the current
Parse::Node
at the top, and state pushed by siblings of ancestors below. The
boundaries between what is a sibling of the current Parse::Node
versus what is
a sibling of an ancestor are not explicitly determined. Instead, the handler for
the parent node knows how many nodes it must pop from the stack based either on
knowing the fixed number of children for that node kind or popping nodes until
it reaches a bracketing node. The arity or bracketing node kind for each parent
node is documented in parse/node_kind.def.
When each Parse::Node
is evaluated, the SemIR for it is typically immediately
generated as SemIR::Inst
s. To help generate the IR to an appropriate context,
scopes have separate SemIR::InstBlock
s.
Sometimes, nodes will need to have delayed evaluation; for example, an inline
definition of a class member function needs to be evaluated after the class is
fully declared. The SemIR::Inst
s cannot be immediately generated because they
may include name references to the class. We're likely to store a reference to
the relevant Parse::Node
for each definition for re-evaluation after the class
scope completes. This means that nodes in a definition would be traversed twice,
once while determining that they're inline and without full checking or IR
generation, then again with full checking and IR generation.
Templates need to have partial semantic checking when declared, but can't be fully implemented before they're instantiated against a specific type.
We are likely to generate a partial IR for templates, allowing for checking with
the incomplete information in the IR. Instantiation will likely use that IR and
fill in the missing information, but it could also reevaluate the original
Parse::Node
s with the known template state.
Carbon relies on rewrites of code, such as rewriting the destination of an initializer to a specific target object once that object is known.
We have two ways to achieve this. One is to track the IR location of a
placeholder instruction and, if it needs updating, replace it with a "rewrite"
SemIR::Inst
that points to a new SemIR::InstBlock
containing the required IR
and specifying which value is the result of that rewrite. This is expressed in
SemIR as a splice_block
instruction. Another is to track the list of
instructions to be created separately from the node block stack, and merge those
instructions into the current block once we have decided on their contents.
Type expressions are treated like any other expression, and are modeled as
SemIR::Inst
s. The types computed by type expressions are deduplicated,
resulting in a canonical SemIR::TypeId
for each distinct type.
The TypeId
preserves only the identity of the type, not its spelling, and so
printing it will produce a fully-resolved type name, which isn't a great user
experience as it doesn't reflect how the type was written in the source code.
Instead, when printing a type name for use in a diagnostic, we will start with
one of two InstId
s:
- A
InstId
for a type expression that describes the way the type was computed. - A
InstId
for an expression that has the given type.
In the former case, the type is pretty-printed by walking the type expression
and printing it. In the latter case, the type of the expression is reconstructed
based on the form of the expression: for example, to print the type of &x
, we
print the type of x
and append a *
, being careful to take potential
precedence issues into account.
TODO: This requires being able to print the type of, for example,
x.foo[0].bar
, by printing only the desired portion of the type of x
, and
similarly may require handling the case where the type of an expression involves
generic parameters whose arguments are specified by that expression. In effect,
the type computation performed when checking an operation is duplicated into the
type printing logic, but is simpler because errors don't need to be detected.
This approach means we don't need to preserve a fully-sugared type for each expression instruction. Instead, we compute that type when we need to print it.
Each SemIR::Inst
that has an associated type also has an expression category,
which describes how it produces a value of that type. These
SemIR::ExprCategory
values correspond to the Carbon expression categories
defined in proposal
#2006:
This instruction is not an expression instruction, and doesn't have an expression category. This is used for namespaces, control flow instructions, and other constructs that represent some non-expression-level semantics.
This instruction produces a value using the type's value representation. Lowering the instruction will produce an LLVM value using that value representation.
This instruction produces a reference to an object. Lowering will produce a pointer to an object representation.
This instruction represents the initialization of an object. Depending on the initializing representation for the type, the initializing expression instruction will do one of the following:
-
For an in-place initializing representation, the instruction will store a value to the target of the initialization.
-
For a by-copy initializing representation, the instruction will produce an object representation by value that can be stored into the target. This is currently only used in cases where the object representation and the value representation are the same.
-
For a type with no initializing representation, such as an empty struct or tuple, it does neither of the above things.
Regardless of the initializing representation, an initializing expression should
be consumed by another instruction that finishes the initialization. For a
by-copy initialization, this final instruction represents the store into the
target, whereas in the other cases it is only used to track in SemIR how the
initialization was used. When an in-place initializer uses a by-copy initializer
as a subexpression, an initialize_from
instruction is inserted to perform this
final store.
This instruction represents a language construct that doesn't have a single expression category. This is used for struct and tuple literals, where the elements of the literal can have different expression categories. Instructions with a mixed expression category are treated as a special case in conversion, which recurses into the elements of those instructions before performing conversions.
A value binding represents a conversion from a reference expression to the value stored in that expression. There are three important cases here:
-
For types with a by-copy value representation, such as
i32
, a value binding represents a load from the address indicated by the reference expression. -
For types with a by-pointer value representation, such as arrays and large structs and tuples, a value binding implicitly takes the address of the reference expression.
-
For structs and tuples, the value representation is a struct or tuple of the elements' value representations, which is not necessarily the same as a struct or tuple of the elements' object representations. In the case where the value representation is not a copy of, or pointer to, the object representation,
value_binding
instructions are not used, and atuple_value
orstruct_value
instruction is used to construct a value representation instead.value_binding
should still be used in the case where the value and object representation are the same, but this is not yet implemented.
Parse::Tree
errors will typically indicate that checking would error for a
given context. We'll want to be careful about how this is handled, but we'll
likely want to generate diagnostics for valid child nodes, then reduce
diagnostics once invalid nodes are encountered. We should be able to reasonably
abandon generated IR of the valid children when we encounter an invalid parent,
without severe effects on surrounding checks.
For example, an invalid line of code in a function might generate some
incomplete IR in the function's SemIR::InstBlock
, but that IR won't negatively
interfere with checking later valid lines in the same function.
Clang creates an AST as part of compilation. In Carbon, it's something we could do as a step between parsing and checking, possibly replacing the SemIR. It's likely that doing so would be simpler, amongst other possible trade-offs. However, we think the SemIR approach is going to yield higher performance, enough so that it's the chosen approach.