Synopsis 9: Data Structures
Created: 13 Sep 2004
Last Modified: 16 Oct 2015
Version: 52
This synopsis summarizes the non-existent Apocalypse 9, which discussed in detail the design of Perl 6 data structures. It was primarily a discussion of how the existing features of Perl 6 combine to make it easier for the PDL folks to write numeric Perl.
All list contexts are lazy by default. They might still flatten eventually, but only when forced to. You have to use the eager
list operator to get a non-lazy list context, and you have to use the flat
operator to guarantee flattening. However, such context is generally provided by the eventual destination anyway, so you don't usually need to be explicit.
Sized low-level types are named most generally by appending the number of bits to a generic low-level type name:
int1
int2
int4
int8
int16
int32 (aka int on 32-bit machines)
int64 (aka int on 64-bit machines)
int128 (aka int on 128-bit machines)
uint1 (aka bit)
uint2
uint4
uint8 (aka byte)
uint16
uint32
uint64
uint128
num16
num32
num64 (aka num on most architectures)
num128
complex16
complex32
complex64 (aka complex on most architectures)
complex128
rat8
rat16
rat32
rat64
rat128
buf8 aka buf, a "normal" byte buffer
buf16 a uint16 buffer
buf32 a uint32 buffer
buf64 a uint64 buffer
Complex sizes indicate the size of each num
component rather than the total. This would extend to tensor typenames as well if they're built-in types. Of course, the typical tensor structure is just reflected in the dimensions of the array--but the principle still holds that the name is based on the number of bits of the simple base type.
The unsized types int
and num
are based on the architecture's normal size for int
and double
in whatever version of C the run-time system is compiled in. So int
typically means int32
or int64
, while num
usually means num64
, and complex
means two of whatever num
turns out to be. For symmetry around the decimal point, native rat
s have a numerator that is twice the size of their denominator, such that a rat32
actually has an int64
for its numerator. Custom rational types may be created by instantiating the Rational
role with two types; if both types used are native types, the resulting type is considered a native type.
You are, of course, free to use macros or type declarations to associate additional names, such as "short" or "single". These are not provided by default. An implementation of Perl is not required to support 64-bit integer types or 128-bit floating-point types unless the underlying architecture supports them. 16-bit floating-point is also considered optional in this sense.
And yes, an int1
can store only -1 or 0. I'm sure someone'll think of a use for it...
Note that these are primarily intended to represent storage types; the compiler is generally free to keep all intermediate results in wider types in the absence of declarations or explicit casts to the contrary. Attempts to store an intermediate result in a location that cannot hold it will generally produce a warning on overflow. Underflow may also warn depending on the pragmatic context and use of explicit rounding operators. The default rounding mode from Num
to Int
is to truncate the fractional part without warning. (Note that warnings are by definition resumable exceptions; however, an exception handler is free to either transform such a warning into a fatal exception or ignore it completely.)
An explicit cast to a storage type has the same potential to throw an exception as the actual attempt to store to such a storage location would.
With IEEE floating-point types, we have a bias towards the use of in-band +Inf
, -Inf
, and NaN
values in preference to throwing an exception, since this is construed as friendlier to vector processing and pipelining. Object types such as Num
and Int
may store additional information about the nature of the failure, perhaps as an unthrown exception or warning.
A class whose attributes are all low-level value types can behave as a struct. (Access from outside the class is still only through accessors, though, except when the address of a serialized version of the object is used or generated for interfacing to C-like languages.) Whether such a class is actually stored compactly is up to the implementation, but it ought to behave that way, at least to the extent that it's trivially easy (from the user's perspective) to read and write to the equivalent C structure. That is, when serialized or deserialized to the C view, it should look like the C struct, even if that's not how it's actually represented inside the class. (This is to be construed as a substitute for at least some of the current uses of pack
/unpack
.) Of course, a lazy implementation will probably find it easiest just to keep the object in its serialized form all the time. In particular, an array of compact structs must be stored in their serialized form (see next section).
For types that exist in the C programming language, the serialized mapping in memory should follow the same alignment and padding rules by default. Integers smaller than a byte are packed into a power-of-two number of bits, so a byte holds four 2-bit integers. Datum sizes that are not a power of two bits are not supported unless declared by the user with sufficient information to determine how to lay them out in memory, possibly with a pack/unpack format associated with the class, or with the strange elements of the class, or with the types under which the strange element is declared.
Note that a compact struct that has no mutators is itself a value type, so except for performance considerations, it doesn't matter how many representations of it there are in memory as long as those are consistent. On the other hand, structs with mutators must behave more like normal mutable objects.
The packing serialization is performed by coercion to an appropriate buffer type. The unpacking is performed by coercion of such a buffer type back to the type of the compact struct.
Standard array indices are specified using square brackets. Standard indices always start at zero in each dimension of the array (see "Multidimensional arrays"), and are always contiguous:
@dwarves[0] = "Happy"; # The 1st dwarf
@dwarves[6] = "Doc"; # The 7th dwarf
@seasons[0] = "Spring"; # The 1st season
@seasons[2] = "Autumn"|"Fall"; # The 3rd season
A basic array declaration like:
my @array;
declares a one-dimensional array of indeterminate length. Such arrays are autoextending. For many purposes, though, it's useful to define array types of a particular size and shape that, instead of autoextending, fail if you try to access outside their declared dimensionality. Such arrays tend to be faster to allocate and access as well. (The language must, however, continue to protect you against overflow--these days, that's not just a reliability issue, but also a security issue.)
To declare an array of fixed size, specify its maximum number of elements in square brackets immediately after its name:
my @dwarves[7]; # Valid indices are 0..6
my @seasons[4]; # Valid indices are 0..3
No intervening whitespace is permitted between the name and the size specification, but "unspace" is allowed:
my @values[10]; # Okay
my @keys [10]; # Error
my @keys\ [10]; # Okay
Note that the square brackets are a compile-time declarator, not a run-time operator, so you can't use the "dotted" form either:
my @values.[10]; # An indexing, not a fixed-size declaration
my @keys\ .[10]; # Ditto
Attempting to access an index outside an array's defined range will fail:
@dwarves[7] = 'Sneaky'; # Fails with "invalid index" exception
However, it is legal for a range or sequence iterator to extend beyond the end of an array as long as its min value is a valid subscript; when used as an rvalue, the range is truncated as necessary to map only valid locations. (When used as an lvalue, any non-existent subscripts generate WHENCE proxies that can receive new values and autovivify anything that needs it.)
It's also possible to explicitly specify a normal autoextending array:
my @vices[*]; # Length is: "whatever"
# Valid indices are 0..*
For subscripts containing range or sequence iterators extending beyond the end of autoextending arrays, the range is truncated to the actual current size of the array rather than the declared size of that dimension. It is allowed for such a range to start one after the end, so that
@array[0..*]
merely returns ()
if @array
happens to be empty. However,
@array[1..*]
would fail because the range's min is too big.
Note that these rules mean it doesn't matter whether you say
@array[*]
@array[0 .. *]
@array[0 .. *-1]
because they all end up meaning the same thing.
There is no autotruncation on the left end. It's not that hard to write 0
, and standard indexes always start there.
Subscript size declarations may add a named :map
argument supplying a closure, indicating that all index values are to be mapped through that closure. For example, a subscript may be declared as cyclical:
my @seasons[4, :map( * % 4 )];
my @seasons[4, :map{ $_ % 4 }]; # same thing
In this case, all numeric values are taken modulo 4, and no range truncation can ever happen. If you say
@seasons[-4..7] = 'a' .. 'l';
then each element is written three times and the array ends up with ['i','j','k','l']
. The mapping function is allowed to return fractional values; the index will be the floor
of that value. (It is still illegal to use a numeric index less that 0.) One could map indexes logarithmically, for instance, as long as the numbers aren't so small they produce negative indices.
Another use might be to map positive numbers to even slots and negative numbers to odd slots, so you get indices that are symmetric around 0 (though Perl is not going to track the max-used even and odd slots for you when the data isn't symmetric).
The type of value stored in each element of the array (normally Any
for unspecified type) can be explicitly specified too, as an external of
type:
my num @nums; # Each element stores a native number
my @nums of num; # Same
my Book @library[1_000_000]; # Each element stores a Book object
my @library[1_000_000] of Book; # Same
Alternatively, the element storage type may be specified as part of the dimension specifier (much like a subroutine definition):
my @nums[-->num];
my @library[1_000_000 --> Book];
In declarations of the form:
my bit @bits;
my int @ints;
my num @nums;
my int4 @nybbles;
my buf @buffers;
my complex128 @longdoublecomplex;
my Array @ragged2d;
the presence of a low-level type tells Perl that it is free to implement the array with "compact storage", that is, with a chunk of memory containing contiguous (or as contiguous as practical) elements of the specified type without any fancy object boxing that typically applies to undifferentiated scalars. (Perl tries really hard to make these elements look like objects when you treat them like objects--this is called autoboxing.)
Unless explicitly declared to be of fixed size, such arrays are autoextending just like ordinary Perl arrays (at the price of occasionally copying the block of data to another memory location, or using a tree structure).
A compact array is for most purposes interchangeable with the corresponding buffer type. For example, apart from the sigil, these are equivalent declarations:
my uint8 @buffer;
my buf8 $buffer;
(Note: If you actually said both of those, you'd still get two different names, since the sigil is part of the name.)
So given @buffer
you can say
$piece = substr(@buffer, $beg, $end - $beg);
and given $buffer
you can also say
@pieces = $buffer[$n ..^ $end];
Note that subscripting still pulls the elements out as numbers, but substr()
returns a buffer of the same type.
For types that exist in the C programming language, the mapping in memory should follow the same alignment rules, at least in the absence of any declaration to the contrary. For interfacing to C pointer types, any buffer type may be used for its memory pointer; note, however, that the buffer knows its length, while in C that length typically must be passed as a separate argument, so the C interfacing code needs to support this whenever possible, lest Perl inherit all the buffer overrun bugs bequeathed on us by C. Random C pointers should never be converted to buffers unless the length is also known. (Any call to strlen() should generally be considered a security hole.) The size of any buffer type in bytes may be found with the .bytes
method, even if the type of the buffer elements is not byte
. (Strings may be asked for their size in bytes only if they support a buffer type as their minimum abstraction level, hopefully with a known encoding. Otherwise you must encode them explicitly from the higher-level abstraction into some buffer type.)
Perl 6 arrays are not restricted to being one-dimensional (that's simply the default). To declare a multidimensional array, you specify it with a semicolon-separated list of dimension lengths:
my int @ints[4;2]; # Valid indices are 0..3 ; 0..1
my @calendar[12;31;24]; # Valid indices are 0..11 ; 0..30 ; 0..23
Arrays may also be defined with a mixture of fixed and autoextending dimensions. For example, there are always 12 months in a year and 24 hours in a day, but the number of days in the month can vary:
my @calendar[12;*;24]; # day-of-month dimension unlimited/ragged
You can pass a slice (of any dimensionality) for the shape as well:
@shape = 4, 2;
my int @ints[ ||@shape ];
The prefix:<||>
operator interpolates a list into a semicolon list at the semicolon level.
The shape in the declaration merely specifies how the array will autovivify on first use, but ends up as an attribute of the actual container object thereby. On the other hand, the shape may be also supplied entirely by an explicit constructor at run-time:
my num @nums = Array of num.new(:shape(3;3;3));
my num @nums .=new():shape(3;3;3); # same thing
A multidimensional array is indexed by a semicolon list (or "semilist"), which is really a list of lists in disguise. Each sublist is a slice of one particular dimension. So:
@array[0..10; 42; @x]
is really short for something like:
@array.postcircumfix:<[ ]>( (0..10), (42), (@x) );
The method's internal **@slices
parameter turns the subscripts into three independent lists, which can be read lazily independently of one other. (Though a subscripter will typically use them left-to-right as it slices each dimension in turn.)
Note that:
@array[@x,@y]
is always interpreted as a one-dimensional slice in the outermost dimension, which is the same as:
@array[@x,@y;]
or more verbosely:
@array.postcircumfix:<[ ]>( ((@x,@y)) );
To interpolate an array at the semicolon level rather than the comma level, use the prefix:<||>
operator:
@array[||@x]
which is equivalent to
@array.postcircumfix:<[ ]>( ((@x[0]), (@x[1]), (@x[2]), etc.) );
Any dimension of the array may be declared as "*
", in which case that dimension will autoextend. Typically this would be used in the final dimension to make a ragged array functionally equivalent to an array of arrays:
my int @ints[42; *]; # Second dimension unlimited/ragged
push(@ints[41], getsomeints());
but any dimension of an array may be declared as autoextending:
my @calendar[12;*;24]; # day-of-month dimension unlimited/ragged
@calendar[1;42;8] = 'meeting' # See you on January 42nd
It is also possible to specify that an array has an arbitrary number of dimensions, using a "hyperwhatever" (**
) at the end of the dimensional specification:
my @grid[**]; # Any number of dimensions
my @spacetime[*;*;*;**]; # Three or more dimensions
my @coordinates[100;100;100;**]; # Three or more dimensions
Note that **
is a shorthand for something that means ||(* xx *)
, so the extra dimensions are all of arbitrary size. To specify an arbitrary number of fixed-size dimensions, write:
my @coordinates[ ||(100 xx *) ];
This syntax is also convenient if you need to define a large number of consistently sized dimensions:
my @string_theory[ ||(100 xx 11) ]; # 11-dimensional
Any array may also be given a second set of user-defined indices, which need not be zero-based, monotonic, or even integers. Whereas standard array indices always start at zero, user-defined indices may start at any finite value of any enumerable type. Standard indices are always contiguous, but user-defined indices need only be distinct and in an enumerable sequence.
To define a set of user-defined indices, specify an explicit or enumerable list of the indices of each dimension (or the name of an enumerable type) in a set of curly braces immediately after the array name:
my @dwarves{ 1..7 };
my @seasons{ <Spring Summer Autumn Winter> };
my enum Months
«:Jan(1) Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec»;
my @calendar{ Months; 1..31; 9..12,14..17 }; # Business hours only
Array look-ups via user-defined indices are likewise specified in curly braces instead of square brackets:
@dwarves{7} = "Doc"; # The 7th dwarf
say @calendar{Jan;13;10}; # Jan 13th, 10am
User-defined indices merely provide a second, non-standard "view" of the array; the underlying container remains the same. Each user-defined index in each dimension is mapped one-to-one back to the standard (zero- based) indices of that dimension. So, given the preceding definitions:
maps to
@dwarves{1} ------> @dwarves[0]
@dwarves{2} ------> @dwarves[1]
: :
@dwarves{7} ------> @dwarves[6]
and:
maps to
@seasons{'Spring'} ------> @seasons[0]
@seasons{'Summer'} ------> @seasons[1]
@seasons{'Autumn'} ------> @seasons[2]
@seasons{'Winter'} ------> @seasons[3]
@seasons<Spring> ------> @seasons[0]
@seasons<Summer> ------> @seasons[1]
@seasons<Autumn> ------> @seasons[2]
@seasons<Winter> ------> @seasons[3]
and:
maps to
@calendar{Jan;1;9} ------> @calendar[0;0;0]
@calendar{Jan;1;10} ------> @calendar[0;0;1]
: :
@calendar{Jan;1;12} ------> @calendar[0;0;3]
@calendar{Jan;1;14} ------> @calendar[0;0;4]
: :
@calendar{Feb;1;9} ------> @calendar[1;0;0]
: :
@calendar{Dec;31;17} ------> @calendar[11;30;7]
User-defined indices can be open-ended, but only on the upper end (i.e. just like standard indices). That is, you can specify:
my @sins{7..*}; # Indices are: 7, 8, 9, etc.
but not:
my @virtue{*..6};
my @celebs{*};
These last two are not allowed because there is no first index, and hence no way to map the infinity of negative user-defined indices back to the standard zero-based indexing scheme.
Declaring a set of user-defined indices implicitly declares the array's standard indices as well (which are still zero-based in each dimension). Such arrays can be accessed using either notation. The standard indices provide an easy way of referring to "ordinal" positions, independent of user-specified indices:
say "The first sin was @sins[0]";
# First element, no matter what @sin's user-defined indexes are
Note that if an array is defined with fixed indices (either standard or user-defined), any attempt to use an index that wasn't specified in the definition will fail. For example:
my @values{2,3,5,7,11}; # Also has standard indices: 0..4
say @values[-1]; # Fails (not a valid standard index)
say @values{1}; # Fails (not a valid user-defined index)
say @values{4}; # Fails (not a valid user-defined index)
say @values[5]; # Fails (not a valid standard index)
say @values{13}; # Fails (not a valid user-defined index)
Furthermore, if an array wasn't specified with user-defined indices, any attempt to index it via .{}
will fail:
my @dwarves[7]; # No user-defined indices;
say @dwarves{1}; # Fails: can't map .{1} to a standard .[] index
When :k
, :kv
, or :p
is applied to an array slice, it returns the kind of indices that were used to produce the slice:
@arr[0..2]:p # 0=>'one', 1=>'two', 2=>'three'
@arr{1,3,5}:p # 1=>'one', 3=>'two', 5=>'three'
Adverbs may be applied only to operators, not to terms, so :k
, :kv
, and :p
may not be applied to a full array. However, you may apply an adverb to a Zen slice, which can indicate which set of keys are desired:
my @arr{1,3,5,7,9} = <one two three four five>;
say @arr[]:k; # 0, 1, 2, 3, 4
say @arr{}:k; # 1, 3, 5, 7, 9
The .keys
method also returns the keys of all existing elements. For a multidimensional array each key is a list containing one value for each dimension.
The .shape
method also works on such an array; it returns a slice of the valid keys for each dimension. The component list representing an infinite dimension is necessarily represented lazily. (Note that the .shape
method returns the possible keys, but the cartesian product of the key slice dimensions is not guaranteed to index existing elements in every case. That is, this is not intended to reflect current combinations of keys in use (use :k
for that). Note that you have to distinguish these two forms:
@array[].shape # the integer indices
@array{}.shape # the user-defined indices
Within any array look-up (whether via .[]
or .{}
), the "whatever star" can be used to indicate "all the indices". The meaning of "all" here depends on the definition of the array. If there are no pre-specified indices, the star means "all the indices of currently allocated elements":
my @data # No pre-specified indices
= 21, 43, 9, 11; # Four elements allocated
say @data[*]; # So same as: say @data[0..3]
@data[5] = 101; # Now six elements allocated
say @data[*]; # So same as: say @data[0..5]
If the array is defined with predeclared fixed indices (either standard or user-defined), the star means "all the defined indices":
my @results{1,3...99} # Pre-specified indices
= 42, 86, 99, 1;
say @results[*]; # Same as: say @results[0..49]
say @results{*}; # Same as: say @results{1,3...99}
You can omit unallocated elements, either by using the :v
adverb:
say @results[*]:v; # Same as: say @results[0..3]
say @results{*}:v; # Same as: say @results{1,3,5,7}
or by using a "zen slice":
say @results[]; # Same as: say @results[0..3]
say @results{}; # Same as: say @results{1,3,5,7}
A "whatever star" can also be used as the starting-point of a range within a slice, in which case it means "from the first index":
say @calendar[*..5]; # Same as: say @calendar[0..5]
say @calendar{*..Jun}; # Same as: say @calendar{Jan..Jun}
say @data[*..3]; # Same as: say @data[0..3]
As the end-point of a range, a lone "whatever" means "to the maximum specified index" (if fixed indices were defined):
say @calendar[5..*]; # Same as: say @calendar[5..11]
say @calendar{Jun..*}; # Same as: say @calendar{Jun..Dec}
or "to the largest allocated index" (if there are no fixed indices):
say @data[1..*]; # Same as: say @results[1..5]
The "whatever star" behaves differently than described above when it is treated as a number inside a standard index. In this case it evaluates to the length of the array. This provides a clean and consistent way to count back or forwards from the end of an array:
@array[*-$N] # $N-th element back from end of array
@array[*+$N] # $N-th element at or after end of array
More specifically:
@array[*-2] # Second-last element of the array
@array[*-1] # Last element of the array
@array[+*] # First element after the end of the array
@array[*+0] # First element after the end of the array
@array[*+1] # Second element after the end of the array
@array[*-3..*-1] # Slice from third-last element to last element
@array[*-3..*] # (Same thing via range truncation)
(Note that, if a particular array dimension has fixed indices, any attempt to index elements after the last defined index will fail, except in the case of range truncation described earlier.)
Negative subscripts are never allowed for standard subscripts unless the subscript is declared modular.
The Perl 6 semantics avoids indexing discontinuities (a source of subtle runtime errors), and provides ordinal access in both directions at both ends of the array.
Occasionally it's convenient to be able to mix standard and user-defined indices in a single look-up.
Within a .[]
indexing operation you can use *{$idx}
to convert a user-defined index $idx
to a standard index. That is:
my @lengths{ Months } = (31,28,31,30,31,30,31,31,30,31,30,31);
@lengths[ 2 .. *{Oct} ] # Same as: @lengths[ 2 .. 9 ]
Similarly, within a .{}
indexing operation you can use *[$idx]
to convert from standard indices to user-defined:
@lengths{ *[2] .. Oct } # Same as: @lengths{ Mar .. Oct }
In other words, when treated as an array within an indexing operation, *
allows you to convert between standard and user-defined indices, by acting like an array of the indices of the indexed array. This is especially useful for mixing standard and user-defined indices within multidimensional array look-ups:
# First three business hours of every day in December...
@calendar{Dec; *; *[0..2]}
# Last three business hours of first three days in July...
@calendar[*{July}; 0..2; *-3..*-1]
Extending this feature, you can use **
within an indexing operation as if it were a multidimensional array of all the indices of a fixed number of dimensions of the indexed array:
# Last three business hours of first three days in July...
@calendar{ July; **[0..2; *-3..*-1] }
# Same...
@calendar[ **{July; 1..3}; *-3..*-1]
It is also possible to stack subscript declarations of various types, including a final normal signature to specify named args and return type:
my @array[10]{'a'..'z'}(:$sparse --> MyType);
[Note: the final signature syntax is merely reserved for now, and not expected to work until we figure out what it really means, if it means anything.]
An array @array
can be tied to a PDL at declaration time:
my num @array[||@mytensorshape] is PDL;
my @array is PDL(:shape(2;2;2;2)) of int8;
PDLs are allowed to assume a type of num
by default rather than the usual simple scalar. (And in general, the type info is merely made available to the "tie" implementation to do with what it will. Some data structures may ignore the "of" type and just store everything as general scalars. Too bad...)
Arrays by default are one dimensional, but may be declared to have any dimensionality supported by the implementation. You may use arrays just like scalars -- the main caveat is that you have to use binding rather than assignment to set one without copying:
@b := @a[0,2,4 ... *]
With PDLs in particular, this might alias each of the individual elements rather than the array as a whole. So modifications to @b are likely to be reflected back into @a. (But maybe the PDLers will prefer a different notation for that.)
The dimensionality of an array may be declared on the variable, but the actual dimensionality of the array depends on how it was created. Reconciling these views is a job for the particular array implementation. It's not necessarily the case that the declared dimensionality must match the actual dimensionality. It's quite possible that the array variable is deliberately declared with a different dimensionality to provide a different "view" on the actual value:
my int @array[2;2] is Puddle .= new(:shape(4) <== 0,1,2,3);
Again, reconciling those ideas is up to the implementation, Puddle
in this case. The traits system is flexible enough to pass any metadata required, including ideas about sparseness, raggedness, and various forms of non-rectangleness such as triangleness. The implementation should probably carp about any metadata it doesn't recognize though. The implementation is certainly free to reject any object that doesn't conform to the variable's shape requirements.
A subscript indicates a "slice" of an array. Each dimension of an array is sliced separately, so a multidimensional slice subscript may be supplied as a semicolon-separated list of slice sublists. A three-dimensional slice might look like this:
@x[0..10; 1,0; 1,*+2...*]
It is up to the implementation of @x
to decide how aggressively or lazily this subscript is evaluated, and whether the slice entails copying. (The PDL folks will generally want it to merely produce a virtual PDL where the new array aliases its values back into the old one.)
Of course, a single element can be selected merely by providing a single index value to each slice list:
@x[0;1;42]
For all multidimensional array types, it is expected that cascaded subscripts:
@x[0][1][42]
@x[0..10][1,0][1,*+2...*]
will either fail or produce the same results as the equivalent semicolon subscripts:
@x[0;1;42]
@x[0..10; 1,0; 1,*+2...*]
Built-in array types are expected to succeed either way, even if the cascaded subscript form must be implemented inefficiently by constructing temporary slice objects for later subscripts to use. (User-defined types may choose not to support the cascaded form, but if so, they should fail rather than providing different semantics.) As a consequence, for built-in types of declared shape, the appropriate number of cascaded subscripts may always be optimized into the semicolon form.
At the statement level, a semicolon terminates the current expression. Within any kind of bracketing construct, semicolon notionally separates the sublists of a multidimensional slice, the interpretation of which depends on the context. Such a semicolon list treats each of its sublists separately, deferring the context of the sublist until it is bound somewhere. The storage of these sublists is hidden in the inner workings of the list. It does not produce a list of lists unless the list as a whole is bound into a slice context.
Single dimensional arrays expect simple slice subscripts, meaning they will treat a list subscript as a slice in the single dimension of the array. Multi-dimensional arrays, on the other hand, know how to handle a multidimensional slice, with one subslice for each dimension. You need not specify all the dimensions; if you don't, the unspecified dimensions are "wildcarded". Supposing you have:
my num @nums[3;3;3];
Then
@nums[0..2]
is the same as
@nums[0..2;]
which is the same as
@nums[0,1,2;*;*]
But you should maybe write the last form anyway just for good documentation, unless you don't actually know how many more dimensions there are. For that case use **
:
@nums[0,1,2;**]
If you wanted that 0..2
range to mean
@nums[0;1;2]
it is not good enough to use the |
prefix operator, because that interpolates at the comma level, so:
@nums[ |(0,1,2) ]
just means
@nums[ 0,1,2 ];
Instead, to interpolate at the semicolon level, you need to use the ||
prefix operator:
@nums[ ||(0..2) ]
The zero-dimensional slice:
@x[]
is assumed to want everything, not nothing. It's particularly handy because Perl 6 (unlike Perl 5) won't interpolate a bare array without brackets:
@x = (1,2,3);
say "@x = @x[]"; # prints @x = 1 2 3
Lists are lazy in Perl 6, and the slice lists are no exception. In particular, list generators are not flattened until they need to be, if ever. So a PDL implementation is free to steal the values from these generators and "piddle" around with them:
@nums[$min ..^ $max]
@nums[$min, *+3 ... $max]
@nums[$min, *+3 ... *]
@nums[1,*+2...*] # the odds
@nums[0,*+2...*] # the evens
@nums[1,3...*] # the odds
@nums[0,2...*] # the evens
To rewrite a Perl 5 PDL definition like this:
pp_def(
'inner',
Pars => 'a(n); b(n); [o]c(); ', # the signature, see above
Code => 'double tmp = 0;
loop(n) %{ tmp += $a() * $b(); %}
$c() = tmp;' );
you might want to write a macro that parses something vaguely resembling this:
role PDL_stuff[::TYPE] {
PDLsub inner (@a[$n], @b[$n] --> @c[]) {
my TYPE $tmp = 0;
for ^$n {
$tmp += @a[$_] * @b[$_];
}
@c[] = tmp;
}
}
where that turns into something like this:
role PDL_stuff[::TYPE] {
multi inner (TYPE @a, TYPE @b --> TYPE) {
my $n = @a.shape[0]; # or maybe $n is just a parameter
assert($n == @b.shape[0]); # and this is already checked by PDL
my TYPE $tmp = 0;
for ^$n {
$tmp += @a[$_] * @b[$_];
}
return $tmp;
}
}
Then any class that does PDL_stuff[num]
has an inner()
function that can (hopefully) be compiled down to a form useful to the PDL threading engine. Presumably the macro also stores away the PDL signature somewhere safe, since the translated code hides that information down in procedural code. Possibly some of the [n]
information can come back into the signature via where
constraints on the types. This would presumably make multimethod dispatch possible on similarly typed arrays with differing constraints.
(The special destruction problems of Perl 5's PDL should go away with Perl 6's GC approach, as long as PDL's objects are registered with the run-time system correctly.)
A junction is a superposition of data values pretending to be a single data value. Junctions come in four varieties:
list op infix op
======= ========
any() |
all() &
one() ^
none() (no "nor" op defined)
Note that the infix ops are "list-associative", insofar as
$a | $b | $c
$a & $b & $c
$a ^ $b ^ $c
mean
any($a,$b,$c)
all($a,$b,$c)
one($a,$b,$c)
rather than
any(any($a,$b),$c)
all(all($a,$b),$c)
one(one($a,$b),$c)
Some contexts, such as boolean contexts, have special rules for dealing with junctions. In any item context not expecting a junction of values, a junction produces automatic parallelization of the algorithm. In particular, if a junction is used as an argument to any routine (operator, closure, method, etc.), and the scalar parameter you are attempting to bind the argument to is inconsistent with the Junction
type, that routine is "autothreaded", meaning the routine will be called automatically as many times as necessary to process the individual scalar elements of the junction in parallel. (Each
types are also autothreaded, but are serial and lazy in nature.)
The results of these separate calls are then recombined into a single junction of the same species as the junctive argument. If two or more arguments are junctive, then the argument that is chosen to be "autothreaded" is:
the left-most all or none junction (if any), or else
the left-most one or any junction
with the tests applied in that order.
Each of the resulting set of calls is then recursively autothreaded until no more junctive arguments remain. That is:
substr("camel", 0|1, 2&3)
-> all( substr("camel", 0|1, 2), # autothread the conjunctive arg
substr("camel", 0|1, 3)
)
-> all( any( substr("camel", 0, 2), # autothread the disjunctive arg
substr("camel", 1, 2),
),
any( substr("camel", 0, 3), # autothread the disjunctive arg
substr("camel", 1, 3),
)
)
-> all( any( "ca", # evaluate
"am",
),
any( "cam",
"ame",
)
-> ("ca"|"am") & ("cam"|"ame") # recombine results in junctions
Junctions passed as part of a container do not cause autothreading unless individually pulled out and used as a scalar. It follows that junctions passed as members of a "slurpy" array or hash do not cause autothreading on that parameter. Only individually declared parameters may autothread. (Note that positional array and hash parameters are in fact scalar parameters, though, so you could pass a junction of array or hash objects.)
The exact semantics of autothreading with respect to control structures are subject to change over time; it is therefore erroneous to pass junctions to any control construct that is not implemented via as a normal single dispatch or function call. In particular, threading junctions through conditionals correctly could involve continuations, which are almost but not quite mandated in Perl 6.0.0. Alternately, we may decide that boolean contexts always collapse the junction by default, and the exact value that allowed the collapse to "true" is not available. A variant of that is to say that if you want autothreading of a control construct, you must assign or bind to a non-Mu
container before the control construct, and that assignment or binding to any such container results in autothreading the rest of the dynamic scope. (The performance ramifications of this are not clear without further experimentation, however.) So for now, please limit use of junctions to situations where the eventual binding to a scalar formal parameter is clear.
[This section is considered conjectural.]
An Each
type autothreads like a junction, but does so serially and lazily, and is used only for its mapping capabilities. The prototypical use case is where a hyperoperator would parallelize in an unfortunate way:
@array».say # order not guaranteed
@array.each.say # order guaranteed
Within the scope of a use autoindex
pragma (or equivalent, such as use PDL
(maybe)), any closure that uses parameters as subscripts is also a candidate for autothreading. For each such parameter, the compiler supplies a default value that is a range of all possible values that subscript can take on (where "possible" is taken to mean the declared shape of a shaped array, or the actual shape of an autoextending array). That is, if you have a closure of the form:
-> $x, $y { @foo[$x;$y] }
then the compiler adds defaults for you, something like:
-> $x = @foo.shape[0].range,
$y = @foo.shape[1].range { @foo[$x;$y] }
where each such range is autoiterated for you.
In the abstract (and often in the concrete), this puts an implicit loop around the block of the closure that visits all the possible subscript values for that dimension (unless the parameter is actually supplied to the closure, in which case the supplied value is used as the slice subscript instead).
This implicit loop is assumed to be parallelizable.
So to write a typical tensor multiplication:
Cijkl = Aij * Bkl
you can simply call a closure with no arguments, allowing the autoindex
pragma to fill in the defaults:
use autoindex;
-> $i, $j, $k, $l { @c[$i; $j; $k; $l] = @a[$i; $j] * @b[$k; $l] }();
or you can use the do BLOCK
syntax (see "The do-once loop" in S04) to call that closure, which also implicitly iterates:
use autoindex;
do -> $i, $j, $k, $l {
@c[$i; $j; $k; $l] = @a[$i; $j] * @b[$k; $l]
}
or even use placeholder variables instead of a parameter list:
use autoindex;
do { @c[$^i; $^j; $^k; $^l] = @a[$^i; $^j] * @b[$^k; $^l] };
That's almost pretty.
It is erroneous for an unbound parameter to match multiple existing array subscripts differently. (Arrays being created don't count.)
Note that you could pass any of $i, $j, $k or $l explicitly, or prebind them with a .assuming
method, in which only the unbound parameters autothread.
If you use an unbound array parameter as a semicolon-list interpolator (via the prefix:<||>
operator), it functions as a wildcard list of subscripts that must match the same everywhere that parameter is used. For example,
do -> @wild { @b[ ||@wild.reverse ] = @a[ ||@wild ] };
produces an array with the dimensions reversed regardless of the dimensionality of @a
.
The optimizer is, of course, free to optimize away any implicit loops that it can figure out how to do more efficiently without changing the semantics.
See RFC 207 for more ideas on how to use autothreading (though the syntax proposed there is rather different).
Like arrays, you can specify hashes with multiple dimensions and fixed sets of keys:
my num %hash{<a b c d e f>}; # Only valid keys are 'a'..'f'
my num %hash{'a'..'f'}; # Same thing
my %rainfall{ Months; 1..31 } # Keys: Jan..Dec ; 1..31
Unlike arrays, you can also specify a hash dimension via a non- enumerated type, which then allows all values of that type as keys in that dimension:
my num %hash{<a b c d e f>; Str}; # 2nd dimension key may be any string
my num %hash{'a'..'f'; Str}; # Same thing
my %rainfall{ Months; Int }; # Keys: Jan..Dec ; any integer
To declare a hash that can take any object as a key rather than just a string or integer, say something like:
my %hash{Any};
my %hash{*};
A hash of indeterminate dimensionality is:
my %hash{**};
You can limit the keys to objects of particular types:
my Fight %hash{Dog; Squirrel where {!.scared}};
The standard Hash:
my %hash;
is really short for:
my Mu %hash{Str(Any)};
Note that any type used as a key must be intrinsically immutable, or it has to be able to make a copy that functions as an immutable key, or it has to have copy-on-write semantics. It is erroneous to change a key object's value within the hash except by deleting it and reinserting it.
The order of hash keys is implementation dependent and arbitrary. Unless %hash
is altered in any way, successive calls to .keys
, .kv
, .pairs
, .values
, or .iterator
will iterate over the elements in the same order.
The default hash iterator is a property called .iterator
that can be user replaced. When the hash itself needs an iterator for .pairs
, .keys
, .values
, or .kv
, it calls %hash.iterator()
to start one. In item context, .iterator
returns an iterator object. In list context, it returns a lazy list fed by the iterator. It must be possible for a hash to be in more than one iterator at a time, as long as the iterator state is stored in a lazy list.
The downside to making a hash autosort via the iterator is that you'd have to store all the keys in sorted order, and resort it when the hash changes. Alternately, the entire hash could be tied to an ISAM implementation (not included (XXX or should it be?)).
For multidimensional hashes, the key returned by any hash iterator is a list of keys, the size of which is the number of declared dimensions of the hash. [XXX but this seems to imply another lookup to find the value. Perhaps the associated value can also be bundled in somehow.]
Autovivification will only happen if the vivifiable path is bound to a read-write container. Value extraction (that is, binding to a readonly or copy container) does not autovivify.
Note that assignment is treated the same way as binding to a copy container, so it does not autovivify its right side either.
Any mention of an expression within a Capture (that is, an argument list) delays the autovivification decision to binding time. (Binding to a raw parameter defers the decision even further.)
This is as opposed to Perl 5, where autovivification could happen unintentionally, even when the code looks like a non-destructive test:
# This is Perl 5 code
my %hash;
exists $hash{foo}{bar}; # creates $hash{foo} as an empty hash reference
In Perl 6 these read-only operations are indeed non-destructive:
my %hash;
%hash<foo><bar> :exists; # %hash is still empty
But these bindings do autovivify:
my %hash;
my $val := %hash<foo><bar>;
my @array;
my $capture = \@array[0][0]; # $capture is a Capture object - see S02
my :($obj) := $capture; # @array[0][0] created here
my @array;
foo(@array[0][0]);
sub foo ($obj is rw) {...} # same thing, basically
my %hash;
%hash<foo><bar> = "foo"; # duh
This rule applies to Array
, Hash
, and any other container type that chooses to return an autovivifiable type object (see S12) rather than simply returning Failure
when a lookup fails. Note in particular that, since autovivification is defined in terms of type objects rather than failure, it still works under "use fatal".
This table solidifies the intuition that an operation pertaining to some data structure causes the type object to autovivify to such an object:
operation autovivifies to
========= ===============
push, unshift, .[] Array
.{} Hash
In addition to the above data structures autovivifying, ++
and --
will cause an Int
to appear, ~=
will create a Str
etc; but these are natural consequences of the operators working on Failure
, qualitatively different from autovivifying containers.
The type of the type object returned by a non-successful lookup should be identical to the type that would be returned for a successful lookup. The only difference is whether it's officially instantiated (defined) yet. That is, you cannot distinguish them via .WHAT
or .HOW
, only via .defined
.
Binding of an autovivifiable type object to a non-writeable container translates the type object into a similar type object without its autovivifying closure and puts that new type object into the container instead (with any pertinent historical diagnostic information carried over). There is therefore no magical method you can call on the readonly parameter that can magically autovivify the type object after the binding. The newly bound variable merely appears to be a simple uninitialized value. (The original type object retains its closure in case it is rebound elsewhere to a read-write container.)
Some implementation notes: Nested autovivifications work by making nested type objects that depend on each other. In the general case the containers must produce type objects any time they do not know how the container will be bound. This includes when interpolated into any capture that has delayed binding:
\( 1, 2, %hash<foo><bar> ) # must defer
\%hash<foo><bar> # must defer
In specific situations however, the compiler can know that a value can only be bound readonly. For instance, infix:<+>
is prototyped such that this can never autovivify:
%hash<foo><bar> + 42
In such a case, the container object need not go through the agony of calculating an autovivifying closure that will never be called. On the other hand:
%hash<foo><bar> += 42
binds the left side to a mutable container, so it autovivifies.
Assignment doesn't look like binding, but consider that it's really calling some kind of underlying set method on the container, which must be mutable in order to change its contents.
Larry Wall <[email protected]>