pulley: Reimplement wasm loads/stores & memory opcodes #10154

alexcrichton · 2025-01-29T23:55:15Z

This commit is a large refactoring to reimplement how WebAssembly loads/stores are translated to Pulley opcodes when using the interpreter. Additionally the functionality related to memory support has changed quite a bit with the interpreter as well. This is all based off comments on #10102 with the end goal of folding the two Pulley opcodes today of "do the bounds check" and "do the load" into one opcode. This is intended to reduce the number of opcodes and overall improve interpreter throughput by minimizing turns of the interpreter loop.

The basic idea behind this PR is that a new basic suite of loads/stores are added to Pulley which trap if the address is zero. This provides a route to translate trapping loads/stores in CLIF to Pulley bytecode without actually causing segfaults at runtime. WebAssembly translation to CLIF is then updated to use the select trick for wasm loads/stores where either 0 is loaded from or the actual address is loaded from. Basic support for translation and such is added for this everywhere, and this ensures that all loads/stores for wasm will be translated successfully with Pulley.

The next step was to extend the "g32" addressing mode preexisting in Pulley to support a bounds check as well. New pattern-matches were added to ISLE to search for a bounds check in the address of a trapping load/store. If found then the entire chain of operations necessary to compute the address are folded into a single "g32" opcode which ends up being a fallible load/store at runtime.

To fit all this into Pulley this commit contains a number of refactorings to shuffle around existing opcodes related to memory and extend various pieces of functionality here and there:

Pulley now uses a AddrFoo types to represent addressing modes as a single immediate rather than splitting it up into pieces for each method. For example AddrO32 represents "base + offset32". AddrZ represents the same thing but traps if the address is zero. The AddrG32 mode represents a bounds-checked 32-bit linear memory access on behalf of wasm.
Pulley loads/stores were reduced to always using an AddrFoo immediate. This means that the old offset8 addressing mode was removed without replacement here (to be added in the future if necessary). Additionally the suite of sign-extension modes supported were trimmed down to remove 8-to-64, 16-to-64, and 32-to-64 extensions folded as part of the opcode. These can of course always be re-added later but probably want to be added just for the G32 addressing mode as opposed to all addressing modes.
The interpreter itself was refactored to have an AddressingMode trait to ensure that all memory accesses, regardless of addressing modes, are largely just copy/pastes of each other. In the future it might make sense to implement these methods with a macro, but for now it's copy/paste.
In ISLE the XLoad generic instruction removed its ext field to have extensions handled exclusively in ISLE instead of partly in emit.rs.
Float/vector loads/stores now have "g32" addressing (in addition to the "z" that's required for wasm) since it was easy to add them.
Translation of 1-byte accesses on Pulley from WebAssembly to CLIF no longer has a special case for using a >= b instead of a > b - 1 to ensure that the same bounds-check instruction can be used for all sizes of loads/stores.
The bounds-check which folded a load-of-the-bound into the opcode is now present as a "g32bne" addressing mode. with its of suite of instructions to boo.

Overall this PR is not a 1:1 replacement of all previous opcodes with exactly one opcode. For example loading 8 bits sign-extended to 64-bits is now two opcodes instead of one. Additionally some previous opcodes have expanded in size where for example the 8-bit offset mode was remove in favor of only having 32-bit offsets. The goal of this PR is to reboot how memory is handled in Pulley. All loads/stores now use a specific addressing mode and currently all operations supported across addressing modes are consistently supported. In the future it's expected that some features will be added to some addressing modes and not others as necessary, for example extending the "g32" addressing mode only instead of all addressing modes.

For an evaluation of this PR:

Code size: spidermonkey.cwasm file is reduced from 19M to 16M.
Sightglass: pulldown-cmark is improved by 15%
Sightglass: bz2 is improved by 20%
Sightglass: spidermonkey is improved by 22%
Coremark: score improved by 40%

Overall this PR and new design looks to be a large win. This is all driven by the reduction in opcodes both for compiled code size and execution speed by minimizing turns of the interpreter loop. In the end I'm also pretty happy with how this turned out and I think the refactorings are well worth it.

This commit is a large refactoring to reimplement how WebAssembly loads/stores are translated to Pulley opcodes when using the interpreter. Additionally the functionality related to memory support has changed quite a bit with the interpreter as well. This is all based off comments on bytecodealliance#10102 with the end goal of folding the two Pulley opcodes today of "do the bounds check" and "do the load" into one opcode. This is intended to reduce the number of opcodes and overall improve interpreter throughput by minimizing turns of the interpreter loop. The basic idea behind this PR is that a new basic suite of loads/stores are added to Pulley which trap if the address is zero. This provides a route to translate trapping loads/stores in CLIF to Pulley bytecode without actually causing segfaults at runtime. WebAssembly translation to CLIF is then updated to use the `select` trick for wasm loads/stores where either 0 is loaded from or the actual address is loaded from. Basic support for translation and such is added for this everywhere, and this ensures that all loads/stores for wasm will be translated successfully with Pulley. The next step was to extend the "g32" addressing mode preexisting in Pulley to support a bounds check as well. New pattern-matches were added to ISLE to search for a bounds check in the address of a trapping load/store. If found then the entire chain of operations necessary to compute the address are folded into a single "g32" opcode which ends up being a fallible load/store at runtime. To fit all this into Pulley this commit contains a number of refactorings to shuffle around existing opcodes related to memory and extend various pieces of functionality here and there: * Pulley now uses a `AddrFoo` types to represent addressing modes as a single immediate rather than splitting it up into pieces for each method. For example `AddrO32` represents "base + offset32". `AddrZ` represents the same thing but traps if the address is zero. The `AddrG32` mode represents a bounds-checked 32-bit linear memory access on behalf of wasm. * Pulley loads/stores were reduced to always using an `AddrFoo` immediate. This means that the old `offset8` addressing mode was removed without replacement here (to be added in the future if necessary). Additionally the suite of sign-extension modes supported were trimmed down to remove 8-to-64, 16-to-64, and 32-to-64 extensions folded as part of the opcode. These can of course always be re-added later but probably want to be added just for the `G32` addressing mode as opposed to all addressing modes. * The interpreter itself was refactored to have an `AddressingMode` trait to ensure that all memory accesses, regardless of addressing modes, are largely just copy/pastes of each other. In the future it might make sense to implement these methods with a macro, but for now it's copy/paste. * In ISLE the `XLoad` generic instruction removed its `ext` field to have extensions handled exclusively in ISLE instead of partly in `emit.rs`. * Float/vector loads/stores now have "g32" addressing (in addition to the "z" that's required for wasm) since it was easy to add them. * Translation of 1-byte accesses on Pulley from WebAssembly to CLIF no longer has a special case for using `a >= b` instead of `a > b - 1` to ensure that the same bounds-check instruction can be used for all sizes of loads/stores. * The bounds-check which folded a load-of-the-bound into the opcode is now present as a "g32bne" addressing mode. with its of suite of instructions to boo. Overall this PR is not a 1:1 replacement of all previous opcodes with exactly one opcode. For example loading 8 bits sign-extended to 64-bits is now two opcodes instead of one. Additionally some previous opcodes have expanded in size where for example the 8-bit offset mode was remove in favor of only having 32-bit offsets. The goal of this PR is to reboot how memory is handled in Pulley. All loads/stores now use a specific addressing mode and currently all operations supported across addressing modes are consistently supported. In the future it's expected that some features will be added to some addressing modes and not others as necessary, for example extending the "g32" addressing mode only instead of all addressing modes. For an evaluation of this PR: * Code size: `spidermonkey.cwasm` file is reduced from 19M to 16M. * Sightglass: `pulldown-cmark` is improved by 15% * Sightglass: `bz2` is improved by 20% * Sightglass: `spidermonkey` is improved by 22% * Coremark: score improved by 40% Overall this PR and new design looks to be a large win. This is all driven by the reduction in opcodes both for compiled code size and execution speed by minimizing turns of the interpreter loop. In the end I'm also pretty happy with how this turned out and I think the refactorings are well worth it.

alexcrichton · 2025-01-29T23:56:26Z

I'll note that one thing I'm having difficulty adding a disas test for is something which generates a "g32bne" lowering. I can see the instructions getting emitted in spidermonkey.cwasm but I am having a difficult time reducing things down to something that can be added as an example. Most loads aren't "sinkable" which is why it's not used super often at least.

github-actions · 2025-01-30T02:06:13Z

Subscribe to Label Action

cc @cfallin, @fitzgen

This issue or pull request has been labeled: "cranelift", "cranelift:meta", "isle", "pulley", "wasmtime:api"

Thus the following users have been cc'd because of the following labels:

cfallin: isle
fitzgen: isle, pulley

To subscribe or unsubscribe from this label, edit the .github/subscribe-to-label.json configuration file.

Learn more.

fitzgen

LGTM but some questions that should be resolved below.

fitzgen · 2025-02-04T17:14:29Z

cranelift/codegen/src/prelude_opt.isle

+(decl eq (Type Value Value) Value)
+(extractor (eq ty x y) (icmp ty (IntCC.Equal) x y))
+
+(decl ne (Type Value Value) Value)
+(extractor (ne ty x y) (icmp ty (IntCC.NotEqual) x y))
+
+(decl ult (Type Value Value) Value)
+(extractor (ult ty x y) (icmp ty (IntCC.UnsignedLessThan) x y))
+
+(decl ule (Type Value Value) Value)
+(extractor (ule ty x y) (icmp ty (IntCC.UnsignedLessThanOrEqual) x y))
+
+(decl ugt (Type Value Value) Value)
+(extractor (ugt ty x y) (icmp ty (IntCC.UnsignedGreaterThan) x y))
+
+(decl uge (Type Value Value) Value)
+(extractor (uge ty x y) (icmp ty (IntCC.UnsignedGreaterThanOrEqual) x y))
+
+(decl slt (Type Value Value) Value)
+(extractor (slt ty x y) (icmp ty (IntCC.SignedLessThan) x y))
+
+(decl sle (Type Value Value) Value)
+(extractor (sle ty x y) (icmp ty (IntCC.SignedLessThanOrEqual) x y))
+
+(decl sgt (Type Value Value) Value)
+(extractor (sgt ty x y) (icmp ty (IntCC.SignedGreaterThan) x y))
+
+(decl sge (Type Value Value) Value)
+(extractor (sge ty x y) (icmp ty (IntCC.SignedGreaterThanOrEqual) x y))


Can we put this in the common prelude, rather than duplicating it in both the opt and lower preludes?

It was originally, but the problem is that they're slightly different where the helpers for egraphs have a ty variable and the helpers for backends don't. I'm not sure how to otherwise unify those two myself

FWIW, we want to add the type parameters in the lowering prelude's versions of the bindings eventually, for exactly this reason -- the only reason they aren't there is that lowering rules were written first and we didn't realize we would need them on the constructor side. @mmcloughlin apparently has a private branch where this change has been made -- Michael, would you be willing to think about upstreaming this?

crates/cranelift/src/func_environ.rs

fitzgen · 2025-02-04T17:21:08Z

tests/disas/pulley/memory-inbounds.wat

All the instruction deltas in this file are equal or worse compared to before -- is that expected?

That's a good point, and one where I mostly shrugged this off but shouldn't have! I focused a bit more on this and improved a few cases in 63e2c5a

The remaining ones that got worse are:

offset_just_bad_v2 - looking into this it looks like optimizations shuffle constants around enough that the output CLIF is quite different from normal loads/stores, so I think this is something we'll basically just take a hit on for now and improve later if necessary

maybe_inbounds_v2 - this doesn't codegen well because egraphs can't optimize away the uadd_overflow_trap-with-iconst-inputs that the frontend generates. Another possible fix is to optimize the frontend to skip uadd_overflow_trap in this case (constant address).

fitzgen · 2025-02-04T17:23:07Z

And sorry for the delay on this review, I forgot about it for a little bit

alexcrichton requested review from a team as code owners January 29, 2025 23:55

alexcrichton requested review from abrown, pchickey and fitzgen and removed request for a team, abrown and pchickey January 29, 2025 23:55

alexcrichton mentioned this pull request Jan 30, 2025

Cranelift: define and use monolithic dynamic-Wasm-bounds-check operator in CLIF. #10150

Open

fitzgen approved these changes Feb 4, 2025

View reviewed changes

alexcrichton added 4 commits February 6, 2025 06:19

Use new is_pulley helper more

1bda21d

Improve addrz helper, tighten up memory-inbounds.wat a bit

725d03e

Improve codegen in a few memory-inbounds.wat cases

63e2c5a

Fix test expectation

d054534

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pulley: Reimplement wasm loads/stores & memory opcodes #10154

pulley: Reimplement wasm loads/stores & memory opcodes #10154

alexcrichton commented Jan 29, 2025

alexcrichton commented Jan 29, 2025

github-actions bot commented Jan 30, 2025

fitzgen left a comment

fitzgen Feb 4, 2025

alexcrichton Feb 6, 2025

cfallin Feb 6, 2025

fitzgen Feb 4, 2025

alexcrichton Feb 6, 2025

fitzgen commented Feb 4, 2025

pulley: Reimplement wasm loads/stores & memory opcodes #10154

Are you sure you want to change the base?

pulley: Reimplement wasm loads/stores & memory opcodes #10154

Conversation

alexcrichton commented Jan 29, 2025

alexcrichton commented Jan 29, 2025

github-actions bot commented Jan 30, 2025

Subscribe to Label Action

fitzgen left a comment

Choose a reason for hiding this comment

fitzgen Feb 4, 2025

Choose a reason for hiding this comment

alexcrichton Feb 6, 2025

Choose a reason for hiding this comment

cfallin Feb 6, 2025

Choose a reason for hiding this comment

fitzgen Feb 4, 2025

Choose a reason for hiding this comment

alexcrichton Feb 6, 2025

Choose a reason for hiding this comment

fitzgen commented Feb 4, 2025