Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A noalias related numerical error #250

Closed
newling opened this issue Jan 7, 2025 · 1 comment
Closed

A noalias related numerical error #250

newling opened this issue Jan 7, 2025 · 1 comment

Comments

@newling
Copy link

newling commented Jan 7, 2025

Background information (skippable)

We (iree-amd-aie) compile matmuls, batch matmuls, matmul + elementwise 'kernels' of different sizes through peano. We sometimes run out of program memory. To avoid running out of memory, we have recently implemented an mlir function outlining pass: duplicated blocks of code are then replaced with one-line function calls to the outlined function. This works very well in terms of reducing the amount of program memory we use, but results in severe performance degradation (2x or more slowdowns).

This degradation is a bit surprising, as when we use microkernels we see good performance, and microkernels use the same llvm function calls as the outlined functions do (tail call void @something). This suggests there is some missed optimization when we outline, or something in the way we are constructing our functions that is causing this slowdown.

The function signatures at llvm level (.ll file) that our outlined function are lowered to look something like

define void @generic_matmul_0_outlined(ptr %0, ptr %1, ptr %2) {
   // outlined common code for matmul (for example with m=n=k=32). 
}

The above is for matmul C = A@B, where %0 and %1 are pointers into A and B, and %2 is a pointer into C. The llvm opt pass from peano (we use -O2 and a few other flags) lowers this further to

define void @generic_matmul_0_outlined(ptr nocapture readonly %0, ptr nocapture readonly %1, ptr nocapture %2) {
   // optimized (unrolled, etc) matmul. 
}

Note that the above function signature contains no information about alignment, and no aliasing information. I assume that opt couldn't deduce that C does not alias A and B. Could this be the missing info that's causing the slowdown? In practice we know that %2 above is definitely not alised to %0 or %1. So I've tried manually (in an MLIR pass) adding the noalias attribute to the function signature, resulting in

define void @generic_matmul_0_outlined(ptr nocapture readonly %0, ptr nocapture readonly %1, ptr noalias nocapture %2) {
   ...
}

and this fixes the problem: with the noalias attribute added, the performance with and without outlining is basically the same.

Issue

For some 'unusual' matmuls and batch matmuls, adding the noalias attribute is resulting in numerical errors. I can't see anything wrong with the IR (.ll or .opt.ll files). We only observe the numerical error when opt is run with -O1, -O2, or -O3: at -O0 there is no numerical error. The numerical error is different at -O1 to -O2/-O3 (same numerical error at O2 and O3). All the shapes we're interested in do not have numerical issues, but obviously we want to be able to add noalias for all shapes (if this approach is sensible). One shape which I see the failure for is when the function does a matmul for M=N=K=32 (i.e. A, B, and C are all 32x32 matrices).

I have attached the following files to help triangulate the problem:

File Notes
input.ll The original IR for the function
input_no_alias.ll Above, but with the noalias attribute added to final operand (C)
input.opt0.ll The IR after running opt -O0 on input.ll. Numerically correct.
input.opt1.ll The IR after running opt -O1 on input.ll. Numerically correct.
input.opt2.ll The IR after running opt -O2 on input.ll. Numerically correct.
input_no_alias.opt0.ll The IR after running opt -O0 on input_no_alias.ll. Numerically correct.
input_no_alias.opt1.ll The IR after running opt -O1 on input_no_alias.ll. Numerically incorrect.
input_no_alias.opt2.ll The IR after running opt -O2 on input_no_alias.ll. Numerically incorrect.

input.ll.txt
input_noalias.ll.txt
input.opt0.ll.txt
input.opt1.ll.txt
input.opt2.ll.txt
input_noalias.opt0.ll.txt
input_noalias.opt1.ll.txt
input_noalias.opt2.ll.txt

All files as zips:
opt_files.tar.gz
opt_files.zip

Some observations

The difference between input.opt1.ll and input_no_alias.opt1.ll is only the function signature:

 define void @generic_matmul_0_outlined(ptr nocapture readonly %0, ptr nocapture readonly %1, ptr nocapture %2) local_unnamed_addr #0 {

vs

 define void @generic_matmul_0_outlined(ptr nocapture readonly %0, ptr nocapture readonly %1, ptr noalias nocapture %2) local_unnamed_addr #0 {

Recall -- input_no_alias.opt1.ll gives the numerical error, while input.opt1.ll does not. Presumably this means that peano is using the noalias attribute after opt has run? (I'm not sure what peano does with the optimized llvm IR, pointers of where in the code to look would be helpful).

At O2, the difference between input.opt2.ll and input_no_alias.opt2.ll is more major, the body of the function with the noalias attribute is much shorter (this is presumably why the performance is better with the noalias attribute added with O2 and O3).

Questions

  • Is there anything obviously wrong with the IR, input_no_alias.ll?
  • Is there another, better way to manipulate the function signature of our outlined function, to bridge the performance gap with the inlined (non-outlined) version?
  • Where is the numerical error coming from?
@newling
Copy link
Author

newling commented Jan 9, 2025

Using a new version of peano aka llvm-aie fixes this problem. i.e.
wheel from September 2024 : numerical error
wheel from January 2025 : no numerical error

@newling newling closed this as completed Jan 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant