Some people have mentioned that memcpy
and memmove
functions are hot
when profiling some WebAssembly benchmarks. Some examples:
I've been looking at perf profiles for wasm unity benchmark a bit recently and see that some of the hottest functions are doing memcpy or memset like things. If this is any indication of normal wasm code patterns, I think we could see significant improvement with an intrinsic so it may be worth prioritizing.
In a number of game engines I've been optimizing and benchmarking, interestingly the performance of memcpy() does show up relatively high in profiles. (~2%-5% of total execution time)
I implemented a prototype implementation of a memory.copy
instruction in v8 which just calls out
to v8's MemMove
function. I compared
this to an implementation generated by emscripten and currently used in the Unity demo. This implementation aligns then performs copies using i32.load
and i32.store
. I've also included performance achieved by unrolling this loop manually and increasing the size to i64
.
Each test copies size
bytes from one address to another, non-overlapping. This is repeated N
times. Each row copies a total of 1 Gib of data, and only touches 1 Mib of memory in the source and destination ranges.
This is the core loop:
let mask = Mib - 1;
let start = performance.now();
for (let i = 0; i < N; ++i) {
f(dst_base + dst, src_base + src, size);
dst = (dst + size) & mask;
src = (src + size) & mask;
}
let end = performance.now();
The code for the benchmark can be found here.
Note that this will not run properly without a WebAssembly implementation of memory.copy
. For my tests, I
hacked a version of v8 to replace any exported function called memcpy
or memmove
with a new function with
the following contents:
(func (param $dst i32) (param $src i32) (param $size i32) (result i32)
local.get $dst
local.get $src
local.get $size
memory.copy
local.get $dst)
Here are the results on my machine (x86_64, 2.9GHz, L1 32k, L2 256k, L3 256k):
intrinsic | i64 load/store x 4 | i64 load/store x 2 | i32 load/store x 2 | i32 load/store | |
---|---|---|---|---|---|
size=32b, N=33554432 | 1.382 Gib/s | 1.565 Gib/s | 1.493 Gib/s | 1.275 Gib/s | 1.166 Gib/s |
size=64b, N=16777216 | 3.285 Gib/s | 2.669 Gib/s | 2.383 Gib/s | 1.861 Gib/s | 1.639 Gib/s |
size=128b, N=8388608 | 6.162 Gib/s | 3.993 Gib/s | 3.480 Gib/s | 2.433 Gib/s | 2.060 Gib/s |
size=256b, N=4194304 | 9.939 Gib/s | 5.323 Gib/s | 4.462 Gib/s | 2.724 Gib/s | 2.213 Gib/s |
size=512b, N=2097152 | 15.777 Gib/s | 6.377 Gib/s | 4.913 Gib/s | 3.231 Gib/s | 2.457 Gib/s |
size=1.0Kib, N=1048576 | 17.902 Gib/s | 7.010 Gib/s | 6.112 Gib/s | 3.568 Gib/s | 2.614 Gib/s |
size=2.0Kib, N=524288 | 19.870 Gib/s | 8.248 Gib/s | 6.915 Gib/s | 3.764 Gib/s | 2.699 Gib/s |
size=4.0Kib, N=262144 | 20.940 Gib/s | 9.145 Gib/s | 7.400 Gib/s | 3.871 Gib/s | 2.729 Gib/s |
size=8.0Kib, N=131072 | 21.162 Gib/s | 9.258 Gib/s | 7.672 Gib/s | 3.925 Gib/s | 2.763 Gib/s |
size=16.0Kib, N=65536 | 20.991 Gib/s | 9.758 Gib/s | 7.756 Gib/s | 3.945 Gib/s | 2.773 Gib/s |
size=32.0Kib, N=32768 | 22.504 Gib/s | 9.956 Gib/s | 7.861 Gib/s | 3.966 Gib/s | 2.780 Gib/s |
size=64.0Kib, N=16384 | 22.534 Gib/s | 10.088 Gib/s | 7.931 Gib/s | 3.974 Gib/s | 2.782 Gib/s |
size=128.0Kib, N=8192 | 29.728 Gib/s | 10.032 Gib/s | 7.934 Gib/s | 3.975 Gib/s | 2.782 Gib/s |
size=256.0Kib, N=4096 | 29.742 Gib/s | 10.116 Gib/s | 7.625 Gib/s | 3.886 Gib/s | 2.781 Gib/s |
size=512.0Kib, N=2048 | 29.994 Gib/s | 10.090 Gib/s | 7.627 Gib/s | 3.985 Gib/s | 2.785 Gib/s |
size=1.0Mib, N=1024 | 11.760 Gib/s | 10.091 Gib/s | 7.959 Gib/s | 3.989 Gib/s | 2.787 Gib/s |
Under the current threading proposal, to share a module between multiple agents, the module must be instantiated multiple times: once per agent. Instantiation initializes linear memory with the contents in the module's data segments. If the memory is shared between multiple agents, it will be initialized multiple times, potentially overwriting stores that occurred after the previous initializations.
For example:
;; The module.
(module
(memory (export "memory") 1)
;; Some value used as a counter.
(data (i32.const 0) "\0")
;; Add one to the counter.
(func (export "addOne")
(i32.store8
(i32.const 0)
(i32.add
(i32.load8_u (i32.const 0))
(i32.const 1)))
)
)
// main.js
let moduleBytes = ...;
WebAssembly.instantiate(moduleBytes).then(
({module, instance}) => {
// Increment our counter.
instance.exports.addOne();
// Spawn a new Worker.
let worker = new Worker('worker.js');
// Send the module to the new Worker.
worker.postMessage(module);
});
// worker.js
function onmessage(event) {
let module = event.data;
// Use the module to create another instance.
WebAssembly.instantiate(module).then(
(instance) => {
// Oops, our counter has been clobbered.
});
}
This can be worked around by storing the data segments in a separate module which is only instantiated once, then exporting this memory to be used by another module that contains only code. This works, but it cumbersome since it requires two modules where one should be enough.
When discussing the design of Conditional Segment Initialization,
we found that programmatic memory initialization from a read-only data segment
(via the memory.init
instruction, described below) has similar behavior to the
proposed instruction to copy memory regions from linear memory (memory.copy
,
also described below.)
Copying between regions in linear memory or a table is accomplished with the
new *.copy
instructions:
memory.copy
: copy from one region of linear memory to anothertable.copy
: copy from one region of a table to another
Filling a memory region can be accomplished with memory.fill
:
memory.fill
: fill a region of linear memory with a given byte value
The binary format for the data section currently has a collection of segments, each of which has a memory index, an initializer expression for its offset, and its raw data.
Since WebAssembly currently does not allow for multiple memories, the memory index of each segment must be zero. We can repurpose this 32-bit integer as a flags field where new meaning is attached to nonzero values.
When the low bit of the new flags field is 1
, this segment is passive. A
passive segment will not be automatically copied into the memory or table on
instantiation, and must instead be applied manually using the following new
instructions:
memory.init
: copy a region from a data segmenttable.init
: copy a region from an element segment
A passive segment has no initializer expression, since it will be specified
as an operand to memory.init
or table.init
.
Segments can also be shrunk to size zero by using the following new instructions:
data.drop
: discard the data in an data segmentelem.drop
: discard the data in an element segment
An active segment is equivalent to a passive segment, but with an implicit
memory.init
followed by a data.drop
(or table.init
followed by a
elem.drop
) that is prepended to the module's start function.
Additionally, the reference-types proposal introduces the notion of a function reference (a function whose address is a program value). To support this, element segments can have several encodings, and can also be used to forward-declare functions whose address will be taken; see below.
The reference-types proposal also introduces the bulk instructions table.fill
and table.grow
, both of which take a function reference as an initializer
argument.
The meaning of the bits of the flag field (a varuint32
) for data segments is:
Bit | Meaning |
---|---|
0 | 0=is active, 1=is passive |
1 | if bit 0 clear: 0=memory 0, 1=has memory index |
which yields this view, with the fields carried by each flag value:
Flags | Meaning | Memory index | Offset in memory | Count | Payload |
---|---|---|---|---|---|
0 | Active | init_expr |
varuint32 |
u8 * |
|
1 | Passive | varuint32 |
u8 * |
||
2 | Active with memory index | varuint32 |
init_expr |
varuint32 |
u8 * |
All other flag values are illegal. At present the memory index must be zero, but the upcoming multi-memory proposal changes that.
The meaning of the bits of the flag field (a varuint32
) for element segments is:
Bit | Meaning |
---|---|
0 | 0=is active, 1=is passive |
1 | if bit 0 clear: 0=table 0, 1=has table index |
if bit 0 set: 0=active, 1=declared | |
2 | 0=carries indicies; 1=carries elemexprs |
which yields this view, with the fields carried by each flag value:
Flag | Meaning | Table index | Offset in table | Encoding | Count | Payload |
---|---|---|---|---|---|---|
0 | Legacy active, funcref externval | init_expr |
varuint32 |
idx * |
||
1 | Passive, externval | extern_kind |
varuint32 |
idx * |
||
2 | Active, externval | varuint32 |
init_expr |
extern_kind |
varuint32 |
idx * |
3 | Declared, externval | extern_kind |
varuint32 |
idx * |
||
4 | Legacy active, funcref elemexpr | init_expr |
varuint32 |
elem_expr * |
||
5 | Passive, elemexpr | elem_type |
varuint32 |
elem_expr * |
||
6 | Active, elemexpr | varuint32 |
init_expr |
elem_type |
varuint32 |
elem_expr * |
7 | Declared, elemexpr | elem_type |
varuint32 |
elem_expr * |
All other flag values are illegal. Note that the "declared" attribute is not used by this proposal, but is used by the reference-types proposal.
The extern_kind
must be zero, signifying a function definition. An idx
is a
varuint32
that references an entity in the module, currently only its function
table.
At present the table index must be zero, but the reference-types proposal introduces a notion of multiple tables.
An elem_expr
is like an init_expr
, but can only contain expressions of the following sequences:
Binary | Text | Description |
---|---|---|
0xd0 0x0b |
ref.null end |
Returns a null reference |
0xd2 varuint32 0x0b |
ref.func $funcidx end |
Returns a reference to function $funcidx |
In the MVP, segments are initialized during module instantiation. If any segment would be initialized out-of-bounds, then the memory or table instance is not modified.
This behavior is changed in the bulk memory proposal:
Each active segment is initialized in module-definition order. For each segment, if reading the source or writing the destination would go out of bounds, then instantiation fails at that point. Data that had already been written for previous (in-bounds) segments stays written.
The memory.init
instruction copies data from a given passive segment into a target
memory. The target memory and source segment are given as immediates.
The instruction has the signature [i32 i32 i32] -> []
. The parameters are, in order:
- top-2: destination address
- top-1: offset into the source segment
- top-0: size of memory region in bytes
It is a validation error to use memory.init
with an out-of-bounds segment index.
A trap occurs if:
- the source offset plus size is greater than the length of the source data segment;
this includes the case that the segment has been dropped via
data.drop
- the destination offset plus size is greater than the length of the target memory
The order of writing is unspecified, though this is currently unobservable.
Note that it is allowed to use memory.init
on the same data segment more than
once.
The data.drop
instruction shrinks the size of the segment to zero. After a
data segment has been dropped, it can still be used in a memory.init
instruction, but only a zero-length access at offset zero will not trap. This
instruction is intended to be used as an optimization hint to the WebAssembly
implementation. After a memory segment is dropped its data can no longer be
retrieved, so the memory used by this segment may be freed.
It is a validation error to use data.drop
with an out-of-bounds segment index.
Copy data from a source memory region to destination region. The regions are said to overlap if they are in the same memory and the start address of one region is one of the addresses that's read or written (by the copy operation) in the other region.
This instruction has two immediate arguments: the source and destination memory indices. They currently both must be zero.
Copying takes place as if an intermediate buffer were used, allowing the destination and source to overlap.
The instruction has the signature [i32 i32 i32] -> []
. The parameters are, in order:
- top-2: destination address
- top-1: source address
- top-0: size of memory region in bytes
A trap occurs if:
- the source offset plus size is greater than the length of the source memory
- the destination offset plus size is greater than the length of the target memory
The bounds check is performed before any data are written.
Set all bytes in a memory region to a given byte. This instruction has an immediate argument of which memory to operate on, and it must be zero for now.
The instruction has the signature [i32 i32 i32] -> []
. The parameters are, in order:
- top-2: destination address
- top-1: byte value to set
- top-0: size of memory region in bytes
A trap occurs if:
- the destination offset plus size is greater than the length of the target memory
The bounds check is performed before any data are written.
The table.*
instructions behave similary to the memory.*
instructions, with
the difference that they operate on element segments and tables, instead of
data segments and memories. The offset and length operands of table.init
and
table.copy
have element units instead of bytes as well.
Consider if there are two data sections, the first is always active and the second is conditionally active if global 0 has a non-zero value. This could be implemented as follows:
(import "a" "global" (global i32)) ;; global 0
(memory 1)
(data (i32.const 0) "hello") ;; data segment 0, is active so always copied
(data "goodbye") ;; data segment 1, is passive
(func $start
(if (global.get 0)
;; copy data segment 1 into memory 0 (the 0 is implicit)
(memory.init 1
(i32.const 16) ;; target offset
(i32.const 0) ;; source offset
(i32.const 7)) ;; length
;; The memory used by this segment is no longer needed, so this segment can
;; be dropped.
(data.drop 1))
)
(start $start)
All bulk memory instructions are encoded as a 0xfc prefix byte, followed by another opcode, optionally followed by more immediates:
instr ::= ...
| 0xfc operation:uint8 ...
Name | Opcode | Immediate | Description |
---|---|---|---|
memory.init |
0xfc 0x08 |
segment:varuint32 , memory:0x00 |
copy from a passive data segment to linear memory |
data.drop |
0xfc 0x09 |
segment:varuint32 |
prevent further use of passive data segment |
memory.copy |
0xfc 0x0a |
memory_dst:0x00 memory_src:0x00 |
copy from one region of linear memory to another region |
memory.fill |
0xfc 0x0b |
memory:0x00 |
fill a region of linear memory with a given byte value |
table.init |
0xfc 0x0c |
segment:varuint32 , table:0x00 |
copy from a passive element segment to a table |
elem.drop |
0xfc 0x0d |
segment:varuint32 |
prevent further use of a passive element segment |
table.copy |
0xfc 0x0e |
table_dst:0x00 table_src:0x00 |
copy from one region of a table to another region |
The WebAssembly binary format is designed to be validated in a single pass. If a section requires information to validate, it is guaranteed that this information will be present in a previous section.
The memory.{init,drop}
instructions break this guarantee. Both of these
instructions are used in the Code
section. They each have a data segment
index immediate, but the vector of data segments is not available until the
Data
section is parsed, which occurs after the Code
section.
To keep single-pass validation, the number of data segments defined in the
Data
section must be available before the Code
section. This information is
provided in a new DataCount
section with the code 12
.
Like all sections, the DataCount
section is optional. If present, it must
appear in the following order:
Section Name | Code | Description |
---|---|---|
Type | 1 |
Function signature declarations |
Import | 2 |
Import declarations |
Function | 3 |
Function declarations |
Table | 4 |
Indirect function table and other tables |
Memory | 5 |
Memory attributes |
Global | 6 |
Global declarations |
Export | 7 |
Exports |
Start | 8 |
Start function declaration |
Element | 9 |
Elements section |
DataCount | 12 |
Data segment count |
Code | 10 |
Function bodies (code) |
Data | 11 |
Data segments |
The DataCount
section has just one field that specifies the number of data
segments in the Data
section:
Field | Type | Description |
---|---|---|
count | varuint32 |
count of data segments in Data section |
The binary is malformed if count
is not equal to the number of data segments
in the Data
section. The binary is also malformed if the DataCount
section
is omitted and a memory.init
or data.drop
instruction is used.