You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We (iree-amd-aie) compile matmuls, batch matmuls, matmul + elementwise 'kernels' of different sizes through peano. We sometimes run out of program memory. To avoid running out of memory, we have recently implemented an mlir function outlining pass: duplicated blocks of code are then replaced with one-line function calls to the outlined function. This works very well in terms of reducing the amount of program memory we use, but results in severe performance degradation (2x or more slowdowns).
This degradation is a bit surprising, as when we use microkernels we see good performance, and microkernels use the same llvm function calls as the outlined functions do (tail call void @something). This suggests there is some missed optimization when we outline, or something in the way we are constructing our functions that is causing this slowdown.
The function signatures at llvm level (.ll file) that our outlined function are lowered to look something like
define void @generic_matmul_0_outlined(ptr %0, ptr %1, ptr %2) {
// outlined common code for matmul (for example with m=n=k=32).
}
The above is for matmul C = A@B, where %0 and %1 are pointers into A and B, and %2 is a pointer into C. The llvm opt pass from peano (we use -O2 and a few other flags) lowers this further to
Note that the above function signature contains no information about alignment, and no aliasing information. I assume that opt couldn't deduce that C does not alias A and B. Could this be the missing info that's causing the slowdown? In practice we know that %2 above is definitely not alised to %0 or %1. So I've tried manually (in an MLIR pass) adding the noalias attribute to the function signature, resulting in
and this fixes the problem: with the noalias attribute added, the performance with and without outlining is basically the same.
Issue
For some 'unusual' matmuls and batch matmuls, adding the noalias attribute is resulting in numerical errors. I can't see anything wrong with the IR (.ll or .opt.ll files). We only observe the numerical error when opt is run with -O1, -O2, or -O3: at -O0 there is no numerical error. The numerical error is different at -O1 to -O2/-O3 (same numerical error at O2 and O3). All the shapes we're interested in do not have numerical issues, but obviously we want to be able to add noalias for all shapes (if this approach is sensible). One shape which I see the failure for is when the function does a matmul for M=N=K=32 (i.e. A, B, and C are all 32x32 matrices).
I have attached the following files to help triangulate the problem:
File
Notes
input.ll
The original IR for the function
input_no_alias.ll
Above, but with the noalias attribute added to final operand (C)
input.opt0.ll
The IR after running opt -O0 on input.ll. Numerically correct.
input.opt1.ll
The IR after running opt -O1 on input.ll. Numerically correct.
input.opt2.ll
The IR after running opt -O2 on input.ll. Numerically correct.
input_no_alias.opt0.ll
The IR after running opt -O0 on input_no_alias.ll. Numerically correct.
input_no_alias.opt1.ll
The IR after running opt -O1 on input_no_alias.ll. Numerically incorrect.
input_no_alias.opt2.ll
The IR after running opt -O2 on input_no_alias.ll. Numerically incorrect.
Recall -- input_no_alias.opt1.ll gives the numerical error, while input.opt1.ll does not. Presumably this means that peano is using the noalias attribute after opt has run? (I'm not sure what peano does with the optimized llvm IR, pointers of where in the code to look would be helpful).
At O2, the difference between input.opt2.ll and input_no_alias.opt2.ll is more major, the body of the function with the noalias attribute is much shorter (this is presumably why the performance is better with the noalias attribute added with O2 and O3).
Questions
Is there anything obviously wrong with the IR, input_no_alias.ll?
Is there another, better way to manipulate the function signature of our outlined function, to bridge the performance gap with the inlined (non-outlined) version?
Where is the numerical error coming from?
The text was updated successfully, but these errors were encountered:
Using a new version of peano aka llvm-aie fixes this problem. i.e.
wheel from September 2024 : numerical error
wheel from January 2025 : no numerical error
Background information (skippable)
We (iree-amd-aie) compile matmuls, batch matmuls, matmul + elementwise 'kernels' of different sizes through peano. We sometimes run out of program memory. To avoid running out of memory, we have recently implemented an mlir function outlining pass: duplicated blocks of code are then replaced with one-line function calls to the outlined function. This works very well in terms of reducing the amount of program memory we use, but results in severe performance degradation (2x or more slowdowns).
This degradation is a bit surprising, as when we use microkernels we see good performance, and microkernels use the same llvm function calls as the outlined functions do (
tail call void @something
). This suggests there is some missed optimization when we outline, or something in the way we are constructing our functions that is causing this slowdown.The function signatures at llvm level (.ll file) that our outlined function are lowered to look something like
The above is for matmul
C = A@B
, where%0
and%1
are pointers intoA
andB
, and%2
is a pointer intoC
. The llvmopt
pass from peano (we use -O2 and a few other flags) lowers this further toNote that the above function signature contains no information about alignment, and no aliasing information. I assume that
opt
couldn't deduce thatC
does not aliasA
andB
. Could this be the missing info that's causing the slowdown? In practice we know that%2
above is definitely not alised to%0
or%1
. So I've tried manually (in an MLIR pass) adding thenoalias
attribute to the function signature, resulting inand this fixes the problem: with the
noalias
attribute added, the performance with and without outlining is basically the same.Issue
For some 'unusual' matmuls and batch matmuls, adding the
noalias
attribute is resulting in numerical errors. I can't see anything wrong with the IR (.ll or .opt.ll files). We only observe the numerical error whenopt
is run with -O1, -O2, or -O3: at -O0 there is no numerical error. The numerical error is different at -O1 to -O2/-O3 (same numerical error at O2 and O3). All the shapes we're interested in do not have numerical issues, but obviously we want to be able to addnoalias
for all shapes (if this approach is sensible). One shape which I see the failure for is when the function does a matmul for M=N=K=32 (i.e. A, B, and C are all 32x32 matrices).I have attached the following files to help triangulate the problem:
noalias
attribute added to final operand (C)opt -O0
on input.ll. Numerically correct.opt -O1
on input.ll. Numerically correct.opt -O2
on input.ll. Numerically correct.opt -O0
on input_no_alias.ll. Numerically correct.opt -O1
on input_no_alias.ll. Numerically incorrect.opt -O2
on input_no_alias.ll. Numerically incorrect.input.ll.txt
input_noalias.ll.txt
input.opt0.ll.txt
input.opt1.ll.txt
input.opt2.ll.txt
input_noalias.opt0.ll.txt
input_noalias.opt1.ll.txt
input_noalias.opt2.ll.txt
All files as zips:
opt_files.tar.gz
opt_files.zip
Some observations
The difference between input.opt1.ll and input_no_alias.opt1.ll is only the function signature:
vs
Recall -- input_no_alias.opt1.ll gives the numerical error, while input.opt1.ll does not. Presumably this means that peano is using the
noalias
attribute after opt has run? (I'm not sure what peano does with the optimized llvm IR, pointers of where in the code to look would be helpful).At O2, the difference between input.opt2.ll and input_no_alias.opt2.ll is more major, the body of the function with the
noalias
attribute is much shorter (this is presumably why the performance is better with thenoalias
attribute added with O2 and O3).Questions
The text was updated successfully, but these errors were encountered: