- Abstract
- Problem
- Background
- Proposal
- Details
- Rationale based on Carbon's goals
- Alternatives considered
Adds support for subscripting using the conventional square-bracket syntax, with support for both array-like and slice-like semantics.
Cloned from #1356, which was cloned from #1059.
Carbon needs a convenient way of indexing into objects like arrays and slices.
For purposes of this proposal, a "slice" is an object that refers to some region
of an array, and provides access to it, such as a std::string_view
or
std::span
in C++. Slices have subtly different indexing API semantics than
arrays: indexing into an array can produce an l-value only if the array itself
is an l-value, but indexing into a slice can produce an l-value regardless of
the slice's value category.
Interfaces are Carbon's only static open extension mechanism, so user-defined indexing must be defined in terms of interfaces. In order to support definition checking of generic functions, it must be possible to fully typecheck indexing operations using only the information exposed by those interfaces. And in order to provide coherence, a given indexing expression must execute the same code regardless of how much type information we have about it, so long as we have enough type information to establish that the expression is valid.
The primary problem this proposal needs to solve is how to support both array-like and slice-like indexing semantics while satisfying those requirements.
Carbon will support indexing using the conventional a[i]
subscript syntax.
When a
is an l-value, the result of subscripting will always be an l-value,
but when a
is an r-value, the result can be an l-value or an r-value,
depending on which interface the type implements:
- If subscripting an r-value produces an r-value result, as with an array, the
type should implement
IndexWith
. - If subscripting an r-value produces an l-value result, as with a slice, the
type should implement
IndirectIndexWith
.
IndirectIndexWith
is a subtype of IndexWith
, and subscript expressions are
rewritten to method calls on IndirectIndexWith
if the type is known to
implement that interface, or to method calls on IndexWith
otherwise.
A type can implement at most one of these two interfaces. IndirectIndexWith
provides a final blanket impl
of IndexWith
, which ensures coherence.
One implication of this design is that the requirement of coherence must apply
to value categories as well as types (and for those purposes, "l-value" is a
subcategory of "r-value"). This implies that for any valid expression containing
an r-value sub-expression, replacing that sub-expression with an l-value
expression cannot change the validity or semantics of the outer expression, so
long as the r-value expression has the same type and evaluates to the same
value. This, in turn, ensures that once we know a value implements IndexWith
,
additionally learning that it implements IndirectIndexWith
can't invalidate
code that previously typechecked, or change its semantics.
A subscript expression has the form "lhs [
index ]
". As in C++, this
syntax has the same precedence as .
, ->
, and function calls, and associates
left-to-right with all of them.
Its semantics are defined in terms of the following interfaces:
interface IndexWith(SubscriptType:! Type) {
let ElementType:! Type;
fn At[me: Self](subscript: SubscriptType) -> ElementType;
fn Addr[addr me: Self*](subscript: SubscriptType) -> ElementType*;
}
interface IndirectIndexWith(SubscriptType:! Type) {
impl as IndexWith(SubscriptType);
fn Addr[me: Self](subscript: SubscriptType) -> ElementType*;
}
A subscript expression where lhs has type T
and index has type I
is
rewritten based on the value category of lhs and whether T
is known to
implement IndirectIndexWith(I)
:
- If
T
implementsIndirectIndexWith(I)
, the expression is rewritten to "*((
lhs).(IndirectIndexWith(I).Addr)(
index))
". - Otherwise, if lhs is an l-value, the expression is rewritten to "
*((
lhs).(IndexWith(I).Addr)(
index))
". - Otherwise, the expression is rewritten to "
(
lhs).(IndexWith(I).At)(
index)
".
On their own, these rules would oblige IndirectIndexWith
types to define three
methods to support a single syntax. Worse, they would permit those types to
define those methods in inconsistent ways, which would violate coherence. To
avoid those problems, IndirectIndexWith
provides a blanket final impl
for
IndexWith
:
final external impl forall
[SubscriptType:! Type, T:! IndirectIndexWith(SubscriptType)]
T as IndexWith(SubscriptType) {
let ElementType:! Type = T.(IndirectIndexWith(SubscriptType)).ElementType;
fn At[me: Self](subscript: SubscriptType) -> ElementType {
return *(me.(IndirectIndexWith(SubscriptType).Addr)(index));
}
fn Addr[addr me: Self*](subscript: SubscriptType) -> ElementType* {
return me->(IndirectIndexWith(SubscriptType).Addr)(index);
}
}
Thus, a type that implements IndirectIndexWith
need not, and cannot, provide
its own definitions of IndexWith.At
and IndexWith.Addr
.
An array type could implement subscripting like so:
class Array(template T:! Type) {
external impl as IndexWith(like i64) {
let ElementType:! Type = T;
fn At[me: Self](subscript: i64) -> T;
fn Addr[addr me: Self*](subscript: i64) -> T*;
}
}
And a slice type could look like this:
class Slice(T:! Type) {
external impl as IndirectIndexWith(like i64) {
let ElementType:! Type = T;
fn Addr[me: Self](subscript: i64) -> T*;
}
}
We can even approximate support for indexing on tuple types. We don't have
enough of a design for metaprogramming or variadics to specify it generically,
but we can illustrate how it would look for a particular tuple type, such as
(i8, f64)
:
external impl (i8, f64) as IndexWith(typeof(0)) {
let ElementType:! Type = i8;
fn At[me: Self](0) -> i8;
fn Addr[addr me: Self*](0) -> i8*;
}
external impl (i8, f64) as IndexWith(typeof(1)) {
let ElementType:! Type = f64;
fn At[me: Self](1) -> f64;
fn Addr[addr me: Self*](1) -> f64*;
}
However, this approach presupposes that every integer literal has a distinct type, which is not yet settled. It also only supports indexing with integer literals, not with constant values of ordinary integer types. This is somewhat contrary to Carbon's metaprogramming philosophy: Carbon treats types as values in order to make metaprogramming more like ordinary programming, but here we're doing the reverse, treating values as types. Consequently, the eventual design for tuple-style indexing may look very different.
Indexing is a fundamental operation in procedural programming, and is especially common in performance-critical code, so supporting it with a concise, familiar syntax helps us make code easy to read, understand, and write, and support performance-critical software. Using the same syntax as C++, even for slice-like types, supports interoperability with and migration from existing C++ code.
The reason that slices have different semantics from arrays is that a slice represents an indirection, like a pointer. Indexing behaves the same for slice l-values and r-values for the same reason that dereferencing behaves the same for pointer l-values and r-values. Thus, our attempts to make a single subscripting syntax work for both arrays and slices are arguably analogous to trying to make the same method-call syntax work for both pointers and objects. Or, to put it another way, the difficulty with slices is that they behave like references.
From that point of view, it would be more consistent to have different
subscripting syntaxes for the two cases. The most naive version of that would be
to say that only array-like types support subscripting, and slice-like types
should instead support dereferencing, so indexing into a slice would look like
(*s)[i]
. However, that would be syntactically onerous, and it's not clear how
to define the type of *s
in a way that doesn't beg the question.
Instead, we could define a distinct syntax for slice subscripting, such as
s*[i]
, and define separate interfaces for the two subscript syntax. The two
interfaces would be very similar to IndexWith
and IndirectIndexWith
, but
neither would extend the other, and the rewrite from subscript syntax to method
calls would not depend on type information (although it would depend on value
category). However, any such syntax would be quite novel and unfamiliar, and
there seem to be few good choices for how to spell it. s*[i]
is almost the
only syntax that seems both viable and somewhat mnemonic, but it would add
further complexity to the already-fraught parsing of *
(for humans as well as
tools).
This proposal does not support multiple comma-separated indices (such as
a[i, j]
), which is desirable for types like multidimensional arrays. We do
support syntax like a[(i, j)]
, which is a single index whose type is a tuple,
but the extra parens are syntactically noisy. We could add those parens
implicitly, but that would effectively move the syntactic noise to the
implementation, even for the single-index case (for example
impl Foo as IndexWith((i64,))
). It should be much cleaner to make the
interfaces and their methods variadic, once we have a design for variadics.
This proposal does not provide an obvious path for supporting types like C++'s
std::string_view
or std::span<const T>
, whose subscripting operations expose
read-only access to their contents. It is tempting to try to extend this
proposal to support those use cases, both because of their inherent importance
and because it already has to deal with read-only versus read-write access in
order to support array r-values.
However, there's a fundamental difference between the true immutability of
something like an array r-value, and the contextual lack of mutable access
provided by something like string_view
. While value categories can express the
former, they are not well-suited to expressing the latter. To address these use
cases, Carbon will probably need something like C++'s const
type system, but
that should be largely orthogonal to this proposal.
This proposal does not support subscripting operations that can't produce
l-values. In particular, this means it does not support using subscript syntax
to form slices, as in Python's a[i:j]
or Swift's a[i...j]
. To support this,
we would need a separate pair of interfaces that return by value:
interface RValueIndexWith(SubscriptType:! Type) {
let ElementType:! Type;
fn Addr[addr me: Self*](subscript: SubscriptType) -> ElementType;
}
interface RValueIndirectIndexWith(SubscriptType:! Type) {
extends RValueIndexWith(SubscriptType);
let ElementType:! Type;
fn Addr[me: Self](subscript: SubscriptType) -> ElementType;
}
final external impl [SubscriptType:! Type,
T:! RValueIndirectIndexWith(SubscriptType)]
T as IndexWith(SubscriptType) {
let ElementType:! Type =
T.(RValueIndirectIndexWith(SubscriptType)).ElementType;
fn Addr[addr me: Self*](subscript: SubscriptType) -> ElementType {
return me->Addr(subscript);
}
}
Note that we still need a pair of interfaces, and a blanket final impl
to
enforce consistency, because arrays and slices have different semantics in this
context as well: taking a slice of an r-value array is invalid, because taking a
slice is equivalent to (and presumably implemented in terms of) taking an
address. On the other hand, taking a slice of an r-value slice is valid and
should be supported.
We would likewise need to extend the rewrite rules for subscript syntax to
detect and use implementations of these interfaces. This should not lead to a
combinatorial explosion of cases, though; if T
implements both IndexWith(I)
and RValueIndexWith(I)
, the program should be ill-formed due to ambiguity.
However, it's not clear if this would provide enough benefit to justify the added complexity.
This proposal does not support subscripting operations that insert new elements
into a collection, as in C++'s std::map
and std::unordered_map
, because it
requires subscriptable types to support subscripting of r-values, which are
immutable. We could support this with an additional interface for types that are
subscriptable only as l-values, and a corresponding extension to the rewrite
rules.
However, it's debatable whether such insertion behavior is desirable; it has not
been a clear-cut success in C++. Code like x = m[i];
looks like it reads from
m
, and the fact that it can write to m
is surprising, and easy for even
experienced programmers to overlook. Even for wary readers, it doesn't convey
the author's intent, because it's not clear whether the author assumed i
is
present, or is relying on the implicit insertion. Furthermore, the implicit
insertion means that x = m[i];
won't compile when m
is const. On the other
hand, while it's relatively unsurprising that m[i] = x;
might insert, that
insertion is also potentially inefficient, since it must default-construct a new
value before assigning x
to it.
We could fix most of these problems by giving special treatment to syntax of the
form m[i] = x;
, and defining separate methods for it:
interface IndexWith(SubscriptType:! Type) {
let ElementType:! Type;
fn At[me: Self](subscript: SubscriptType) -> ElementType;
fn Addr[addr me: Self*](subscript: SubscriptType) -> ElementType*;
fn Assign[addr me: Self*](subscript: SubscriptType, element: ElementType) {
(*me->Addr(index)) = element;
}
}
interface IndirectIndexWith(SubscriptType:! Type) {
extends IndexWith(SubscriptType);
let ElementType:! Type;
fn Addr[me: Self](subscript: SubscriptType) -> ElementType*;
fn Assign[me: Self](subscript: SubscriptType, element: ElementType) {
me->Addr(subscript) = element;
}
}
final external impl [SubscriptType:! Type, T:! IndirectIndexWith(SubscriptType)]
T as IndexWith(SubscriptType) {
...
fn Assign[addr me: Self*](subscript: SubscriptType, element: ElementType) {
me->(IndirectIndexWith.AssignAsRValue)(subscript, element);
}
}
With this approach, expressions of the form m[i] = x
would be rewritten to
call the appropriate Assign
method, while all other subscript expressions are
rewritten as in the primary proposal. However, this does add some complexity to
both the interfaces and the rewrite rules (for example, we presumably want
(m[i]) = x
to be treated the same as m[i] = x
), so I'm leaving this as a
potential future extension.
As an alternative way of getting std::map
-like behavior, we could rewrite
subscript expressions to use At
instead of Addr
when the usage context only
needs an r-value, even if the operand is an l-value, in much the same way that
Rust rewrites subscript expressions depending on whether the usage expects the
result to be mutable. This would enable type authors to make m[i] = x;
potentially-inserting while ensuring that x = m[i];
is not, by making Addr
but not At
potentially-inserting.
However, this would mean that m[i].Method()
is potentially inserting if
Method
takes me
by pointer, and would have the same performance drawback as
in C++. Furthermore, it would violate the principle (which we have discussed but
not yet adopted) that type information should flow from subexpressions to the
parent expression, but never the reverse.
A variant of this approach would be to allow the compiler to choose between At
and Addr
when the operand is an l-value but the usage context only needs an
r-value. This might enable the compiler to make the generated code more
efficient and/or safer, and might create pressure for libraries to ensure At
and Addr
are consistent with each other.
However, if the compiler predictably chooses At
under those conditions, that
could create a temptation for library authors to leverage that implementation
detail to try to implement a std::map
-like API, by having Addr
implicitly
insert while At
does not. That would have all the same drawbacks as if we
supported it intentionally, plus the evolutionary drawbacks of having user code
depend on unsupported implementation details.