Streamkv0.3 #659

xiaohuguo2023 · 2024-11-06T20:43:30Z

streamk gemm script v0.3

new persistent gemm kernel
gemm benchmark tool using nearest neighbour approach.

…triton-lang#4951) This PR removes the legacy `isMmaToDotShortcut` and its associated shortcut conversion.

…odegen bug (triton-lang#4873)" (triton-lang#4973) After investigation of the differences caused by triton-lang#4774 in the internal tests, we concluded that they were introduced by change in the layouts selected for the reduce operations. Re-introducing that change, as it is functionally correct and should be beneficial for performance.

This commit adds initial support for scaled_dot with mxfp8 LHS and fp8 RHS. It supports both mfma32 and mfma16 intrinsic variants. Right now we are missing software emulation for `Float8E4M3FN` type, so this only enables for `Float8E5M2`.

…`interpreter.cc` (triton-lang#4976) `#include <atomic>` is already used in other triton files, so I believe it's not a cardinally change. Changes come from triton-lang#4045

Fixes triton-lang#4879

This commit removes special cases for MFMA -> Dot Operand LDS shortcuts. Now it is supported by common linear layout infrastructure. No tests are added, mfma-shortcut.mlir already testing this.

`scaled_dot` is not yet implemented on `gfx11` and `gfx12` so disable unit tests for now.

The string representation allows PyTorch Inductor to serialize/derserialize the `AttrsDescriptor` to the `@triton.heuristics` block in the generated code.

…-lang#5009) Allows for upcasting in DotOp encoding in RF. This lowering path is not currently in use; pending triton-lang#5003

…indows (triton-lang#5014) The `-A` argument is not compatible with the Ninja generator. Signed-off-by: Anatoly Myachev <[email protected]>

…lang#4996) In the passing we also improve a few other things: - Now `scaled_dot` accepts both uint8/uint16 fp8/bf16 as inputs (before you had to cast it to uint8, which was weird when extending it to bf16). - Add `scaled_dot` to the docs and improve the docs overall (have not render them, might need a few further tweaks)

…Py HALF-related code (triton-lang#5010) Closes triton-lang#4992

Example: ```python # On Windows >>> sysconfig.get_config_var("EXE") '.exe' # On Linux >>> sysconfig.get_config_var("EXE") '' ``` --------- Signed-off-by: Anatoly Myachev <[email protected]>

…4935) This PR adds more restrictions about when should we apply the sched-load optimizations and un-revert triton-lang#4823. We will only apply the optimization when all of the following is satisfied: 1. pureMatmulProblem, i.e. 1 `tt.dot` in the main loop 2. two `tt.load`s in the main loop 3. 2nd `tt.load` is ahead of the `tt.dot` 4. 1st user of 2nd `tt.load` is after the `tt.dot` 5. tile size is large enough, i.e. nonKDim >= 128 and kDim >= 64

…eduling (triton-lang#4814) We call assignMemoryLayouts and set up distToUse right before lowering. During scheduling, we call a helper function that checks if a load can be pipelined. loadOpToIndLevelAndUse is used during scheduling and also in assignMemoryLayouts. The next step is to clean up that logic so lowering will not use loadOpToIndLevelAndUse, instead it checks the schedule and finds out which loads are pipelined (i.e uses for loads are in different stages from the loads).

1. Remove unnecessary header files 2. Remove unused `getCvtOrder` since dot operand now has its order defined 3. Remove unnecessary forward declarations

This commit relands triton-lang#4819 with the following fixes: * Changed to a better way to mark opIdx for loads * Replaced temlate-based `rewindUnaryOps` to use regular for-loops. The new way is more robust and can handle other unary ops automatically. * Replaced `instr.sched.barriers` using the ones from `rocdl` dialect from the MLIR upstream * Extended lit tests

@lezcano

…4955) Continuation of the work from @lezcano triton-lang#4698 > With this PR, we make `%` always follow C semantics, similar to `//`. We update the type promotion docs fixing some inaccuracies. It is still not entirely precise though. For a discussion of the current semantics see triton-lang#4697 Pretty sure all that was left were changes for the frem function to emit `np.fmod` instead of `np.remainder` and to ignore ('uint16', 'float64') mod computations in the tests. I believe this combination is ill-conditioned but I could be wrong about that. Co-authored-by: lezcano <[email protected]>

…-lang#5035) Add a barrier to avoid a race condition in case an assert is followed by an op that may trap if the assert condition is true. Since the tensor in those two operations may have different layout we need to make sure all the threads are done executing the assert before going to the next op.

Reported in https://github.com/triton-lang/triton/pull/5033/files#r1825438371 This was broken by triton-lang#4589

triton-lang#5047) Reverts triton-lang#5035

This commit adds support for mxfp4 typed A tensor for sacled dot in the AMD backend. We moved the `convertMxfp4x2ToBf16x2` impl from NVIDIA side to a common path to reuse.

@peterbell10

support asserts with scalar condition and only emit barrier for assert of tensors. Thanks to @peterbell10 for the suggestion.

…e` (triton-lang#5036) The new code structure is shorter and also allows to get rid of one level of code nesting in most places. As a side effect, it makes the code more Windows-friendly. For example, it eliminates situations when an attempt is made to open a file for reading, while a file with the same name is already open for writing: https://github.com/intel/intel-xpu-backend-for-triton/pull/2478/files#r1805224201 (what doesn't work on Windows). Pytest' docs: https://docs.pytest.org/en/stable/how-to/tmp_path.html

…able thermal state (triton-lang#5042) Following the nvidia's recipe for measuring performance in 09-persistent-matmul.py tutorial: get system into a stable thermal state by using long warmup run, then do 1000 runs of benchmark. We couldn't done it in the beginning because creating and passing TMA descriptors was creating GPU bubble that allowed GPU to cool down, thus not reaching equilibrium, skewing TMA kernel results towards unfair higher scores. With changes around passing descriptors via grid constants I see results very close to the version with descriptor re-use, so we can now use this methodology and get correct benchmarking results. Example cmd line for measuring perf of fp8 matmul across K=[512, 8192]: `python 09-persistent-matmul.py --prec fp8 --K_range 512 8192`

…triton-lang#5031) The CombineTensorSelectAndIf pass currently doesn’t work correctly **when the user of select is inside the scf.if block**. For example: ```mlir %select = arith.select %cond, %trueVal, %falseVal : i32 %if = scf.if %cond -> (i32) { %sub = arith.subi %select, %val1 : i32 scf.yield %sub : i32 } else { %mul = arith.muli %select, %val2 : i32 scf.yield %mul : i32 } use %select ``` In this case, dom.dominates(ifOp, user) will return true, but directly using replaceAllUsesWith would lead to incorrect replacement behavior. ```mlir // without this pr (the user in ifOp use the result of ifOp) %if:2 = scf.if %cond -> (i32, i32) { %sub = arith.subi %if#1, %val1 : i32 scf.yield %sub, %trueVal : i32, i32 } else { %mul = arith.muli %if#1, %val2 : i32 scf.yield %mul, %falseVal : i32, i32 } use %if#1 ``` To address this, we need to adjust the user’s operand based on the specific region it is in. ```mlir // with this pr (the user in ifOp be canonicaled first) %if:2 = scf.if %cond -> (i32, i32) { %sub = arith.subi %trueVal, %val1 : i32 scf.yield %sub, %trueVal : i32, i32 } else { %mul = arith.muli %falseVal, %val2 : i32 scf.yield %mul, %falseVal : i32, i32 } use %if#1 ```

…-lang#5041) It's not entirely clear to me whether the previous logic was equivalent to this one, as it was rather obtuse. I think the new one is optimal but I'm happy to run benchmarks to make sure we don't regress.

@lezcano

Two bugfixes following triton-lang#5009. - When `BLOCK_M=64` and `num_warps > 4`, the order of warps for DotOpEncoded tensor should be M-major instead of N-major, since WGMMA expects the 4 warps in each warp group to be stacked along the M dimension. - Should use `mmaBitwidth` instead of `bitwidth` when calculating `numRep` in `SharedToDotOperandMMAv2OrV3`. This was missed in a bad rebase. @lezcano I encountered these bugs when attempting to locally test the [DotOp hoisting PR](triton-lang#5003) after rebasing (they normally would be caught by `test_core.py` but that path was not yet enabled in the last PR). With these fixes added, I was able to successfully validate against pytorch.

…-lang#5051) Signed-off-by: Anatoly Myachev <[email protected]>

Fixes triton-lang#4769

Signed-off-by: Anatoly Myachev <[email protected]>

Fold fp_to_fp op with a zero constant input into a zero constant with fp_to_fp op destination type. --------- Co-authored-by: Ognjen Plavsic <[email protected]> Co-authored-by: Lei Zhang <[email protected]>

This commit enables pipeliner test for scaled dot on the AMD backend. Along the way, unified some target/arch probe utilities into the common `_internal_testing` file.

triton-lang#5064) Bumping llvm to include a loop unroller fix: llvm/llvm-project#114573. This is needed for subsequent loop unroller upstreaming work.

…triton-lang#5027) The epilog loop created by the loop unroller may not be run if the main unrolled loop covers all original loop iterations, thus pipelining it non-speculatively may not be beneficial. It can also cause some correctness issue when combined with the downstream PTXAS optimizer.

…riton-lang#5055) We use `getOrder` very liberally throughout the codebase, when we really meant to use `getThreadOrder`. This is an issue with the input layout is an `DotOperand(mma(opIdx=1))`, where the thread order and the matrix order are opposite. Found this to be an issue when a PR changed the `getOrder` of `DotOperand(Hopper)` to an incorrect one and CI still passed! The issue here is that the LLVM lowering for wgmma and the LinearLayout does not use `getOrder`, but there are many other subsystems do, and many heuristics would be getting an incorrect order, and potentially be disabled. This is particularly problematic for `DotOperand(opIdx=1)` in nvidia hardware, as `getThreadOrder` and `getOrder` are different! While doing so we: - Audit most (all?) the calls to `getOrder(dotOperand)`. It turns out that most of them really meant `getThreadOrder` - Fix the ordering methods of `SliceEncodingAttr` to be consistent - Move the implementation of `getWarpOrder` to the Attr classes, because of OOP The test strategy was to add `llvm::report_fatal_error("Testing");` within `getOrder(nvidiaMma)` and `getOrder(DotOperand(nvidiaMma))` and triaging all errors that were raised in CI.

There were a number of complex regexes used for parsing MLIR in the python frontend. For maintainability reasons, it is likely better to just expose the MLIR bindings to python and use those instead. The PTX regex is left in place because we don't have an easy way to parse PTX (for now).

@peterbell10

…on (triton-lang#5033) Reductions have special handling for side effectful "combine ops" (e.g. "add" for a sum reduction). In the presence of side effects, a predicate is computed to determine whether a thread should participate in the reduction, to ensure that invalid/uninitialized data is not operated on. See triton-lang#4811 for more details. ~Previously, the predicate logic was incorrect for 2D reductions. This PR fixes the logic and adds a python test.~ Edit: after additional discussion with @peterbell10, we removed the lanePred logic. Here's our thinking on why this is valid: * lanePred info is computed based entirely on the blocked layout info and properties of the reduction * the blocked layout won't tell you which threads do or don't have uninitialized data Instead, it sounds like the motivation for triton-lang#4811 is based on uninitialized values that can be indicated by the `pred` variable passed into `warpReduce()`.

Adding AMD support to convert fp8E4M3FN (aka fp8e4nv) to bf16. This will allow fp8E4M3FN to be emulated by bf16 to support scaled_dot(). Previously fp8E4M3FN wasn't supported at all, and this PR adds only a single conversion to bf16.

…triton-lang#5066)

Signed-off-by: Anatoly Myachev <[email protected]> Co-authored-by: peterbell10 <[email protected]>

…riton-lang#5075) This is a follow up to triton-lang#5033 but for scan ops, and also improving the testing as it was clearly insufficient before.

It's a bit annoying to have to remember to pass `TRITON_BUILD_WITH_CCACHE` every time you invoke `pip install`. This changes the default to check if ccache is installed and use it if it's found, with the option to manually disable ccache if it's not wanted for some reason.

These two tests causes significant CI runtime increase (10x) so disabling them to unblock while figuring out issues.

This pulls in some new symbols so we don't need to use hardcoded numbers later. This cherry-picks triton-lang#3989 again to fix macOS builds.

…-lang#5074) triton-lang#5062 enabled upcasting fp8E4M3FN to bf16; so now we can support that variant too.

…le (triton-lang#5079) ExternElementwise ops have a `pure` attribute that marks the op as pure. If an op is pure, it should also be speculatable. In the reduction/scan ttgir->llvm passes, checks for speculatability are failing for ExternElementwise ops, causing additional conditional handling to be added. This PR makes ExternElementwise ops implement ConditionallySpeculatable, and mark the op as speculatable if the op is marked as pure. This removes the conditional branches from the generated scan/reduction code.

Couple of places where not handling slice layout inputs for reductions. Add support for recursive slice layout in those cases.

…riton-lang#5081) Context: in `CodeGenerator.__init__`, globals for a given triton function are modified to handle remapping the libdevice module to cuda or hip (from triton-lang#4539). In particular, this logic: ```python for k, v in gscope.items(): # gscope is a dict of fn.__globals__ ... self.gscope[k] = getattr(module_map[module_name], k) ``` was failing if you do this in the global scope: `from triton.language.extras.libdevice import fast_dividef as my_fast_dividef`.

- Removed functions related to unpacking and packing I32 values. - Updated utilities to handle conversion of mxfp4 values without packing/unpacking I32. - Move the register value ordering logic from the element-wise operation lowering to the dot operation lowering. - Use linear layout to handle conversions between almost all distributed layouts. - Clean up data loading and mma computation involving `repN`, `repK`, and `repM`.

xiaohuguo2023 · 2024-11-06T20:45:44Z

something wrong, let's close this PR

Jokeren and others added 30 commits October 28, 2024 11:08

[BACKEND] Replace isMmaToDotShortcut with linear layout based logic (…

1d5fdfe

…triton-lang#4951) This PR removes the legacy `isMmaToDotShortcut` and its associated shortcut conversion.

[INTERPRETER] Replace GCC __ATOMIC* built-ins with std::atomic for …

7c03aac

…`interpreter.cc` (triton-lang#4976) `#include <atomic>` is already used in other triton files, so I believe it's not a cardinally change. Changes come from triton-lang#4045

[AMD] Restructure ReorderInstructions pass (triton-lang#4998)

86a2ac7

[Proton] Adding Sorting of Kernels (triton-lang#4987)

30e7a2e

[FRONTEND] Fix transpose with tuple dims (triton-lang#5006)

ef61488

Fixes triton-lang#4879

[AMD] remove redundant LDS bypass checks (triton-lang#5002)

69f656c

This commit removes special cases for MFMA -> Dot Operand LDS shortcuts. Now it is supported by common linear layout infrastructure. No tests are added, mfma-shortcut.mlir already testing this.

[AMD] Skip scaled_dot tests for gfx11 and gfx12 (triton-lang#5008)

bf6bd0b

`scaled_dot` is not yet implemented on `gfx11` and `gfx12` so disable unit tests for now.

Add string representation for AttrsDescriptor (triton-lang#4888)

ebce7f3

The string representation allows PyTorch Inductor to serialize/derserialize the `AttrsDescriptor` to the `@triton.heuristics` block in the generated code.

[BACKEND][NVIDIA] Add Lowering for Shared-to-MMAv3-DotOp Copy (triton…

cfddb09

…-lang#5009) Allows for upcasting in DotOp encoding in RF. This lowering path is not currently in use; pending triton-lang#5003

Don't specify -A x64 option and reuse cmake build type config for W…

0591b37

…indows (triton-lang#5014) The `-A` argument is not compatible with the Ninja generator. Signed-off-by: Anatoly Myachev <[email protected]>

[INTERPRETER] Make sure interpreter works with float16 by reusing Num…

61eb94e

…Py HALF-related code (triton-lang#5010) Closes triton-lang#4992

Allow windows cuda files to be used in setup.py (triton-lang#5015)

018c139

Example: ```python # On Windows >>> sysconfig.get_config_var("EXE") '.exe' # On Linux >>> sysconfig.get_config_var("EXE") '' ``` --------- Signed-off-by: Anatoly Myachev <[email protected]>

Fix formatting in docs for triton.language.dot (triton-lang#5020)

ef319c8

Ignore autotune runs failed with PTXAS error (triton-lang#5017)

6693ddd

[NIT][BACKEND] Clean up Allocation.cpp (triton-lang#5021)

9293f0a

1. Remove unnecessary header files 2. Remove unused `getCvtOrder` since dot operand now has its order defined 3. Remove unnecessary forward declarations

[AMD] Enable scaled_dot(-, bf16) (triton-lang#5029)

f062540

[FRONTEND] Fix @triton.jit(debug=True) (triton-lang#5037)

3ca2f49

Reported in https://github.com/triton-lang/triton/pull/5033/files#r1825438371 This was broken by triton-lang#4589

Revert "[BACKEND] Add barrier after assert op to avoid race condition" (

6c72dbe

triton-lang#5047) Reverts triton-lang#5035

[AMD] Add support for scaled_dot(mxfp4, -) (triton-lang#5034)

edc5c5c

This commit adds support for mxfp4 typed A tensor for sacled dot in the AMD backend. We moved the `convertMxfp4x2ToBf16x2` impl from NVIDIA side to a common path to reuse.

Add back barrier after asserts (triton-lang#5043)

92a4fad

support asserts with scalar condition and only emit barrier for assert of tensors. Thanks to @peterbell10 for the suggestion.

tfruan2000 and others added 28 commits November 3, 2024 08:35

Don't use fork method if it's not available on the platform (triton…

9f93976

…-lang#5051) Signed-off-by: Anatoly Myachev <[email protected]>

[BACKEND] Update LLVM hash (triton-lang#5040)

f484cb8

Fixes triton-lang#4769

[NFC] Remove <stdatomic.h> as unused (triton-lang#5063)

a4fd8c6

Signed-off-by: Anatoly Myachev <[email protected]>

Fold fp_to_fp op with zero constant input (triton-lang#5007)

20361eb

Fold fp_to_fp op with a zero constant input into a zero constant with fp_to_fp op destination type. --------- Co-authored-by: Ognjen Plavsic <[email protected]> Co-authored-by: Lei Zhang <[email protected]>

[AMD] Enable pipeliner test for scaled_dot (triton-lang#5068)

af0a9f2

This commit enables pipeliner test for scaled dot on the AMD backend. Along the way, unified some target/arch probe utilities into the common `_internal_testing` file.

[BACKEND] Update llvm to llvm/llvm-project@fa57c7a6a5f594a9e3ae2dbe35… (

3c296ab

triton-lang#5064) Bumping llvm to include a loop unroller fix: llvm/llvm-project#114573. This is needed for subsequent loop unroller upstreaming work.

[BACKEND] Deprecate MMAv1; fallback to FMA for computeCapability < 80 (…

7275ff7

…triton-lang#5066)

Make _path_to_binary windows friendly (triton-lang#5052)

94684d3

Signed-off-by: Anatoly Myachev <[email protected]> Co-authored-by: peterbell10 <[email protected]>

[BACKEND] Fix asserts in 2d scan and add assert mode to layout tests (t…

11d85f7

…riton-lang#5075) This is a follow up to triton-lang#5033 but for scan ops, and also improving the testing as it was clearly insufficient before.

Update version to 3.2.0 (triton-lang#5016)

c802bb4

[AMD] Disable overflow check while fixing issues (triton-lang#5078)

627ebbb

These two tests causes significant CI runtime increase (10x) so disabling them to unblock while figuring out issues.

[AMD] Update HIP headers to 6.2.2 (triton-lang#5077)

d6739d3

This pulls in some new symbols so we don't need to use hardcoded numbers later. This cherry-picks triton-lang#3989 again to fix macOS builds.

[AMD] Enable all existing scaled_dot data type tests on MI300 (triton…

03968e6

…-lang#5074) triton-lang#5062 enabled upcasting fp8E4M3FN to bf16; so now we can support that variant too.

[BACKEND] Fix reduce with slice layout inputs (triton-lang#5080)

35f1827

Couple of places where not handling slice layout inputs for reductions. Add support for recursive slice layout in those cases.

add streamk v0.3

a493381

Merge branch 'triton-lang:main' into streamkv0.3

6979834

xiaohuguo2023 closed this Nov 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Streamkv0.3 #659

Streamkv0.3 #659

xiaohuguo2023 commented Nov 6, 2024

xiaohuguo2023 commented Nov 6, 2024

Streamkv0.3 #659

Streamkv0.3 #659

Conversation

xiaohuguo2023 commented Nov 6, 2024

xiaohuguo2023 commented Nov 6, 2024