forked from triton-lang/triton
-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Streamkv0.3 #659
Closed
Closed
Streamkv0.3 #659
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…triton-lang#4951) This PR removes the legacy `isMmaToDotShortcut` and its associated shortcut conversion.
…odegen bug (triton-lang#4873)" (triton-lang#4973) After investigation of the differences caused by triton-lang#4774 in the internal tests, we concluded that they were introduced by change in the layouts selected for the reduce operations. Re-introducing that change, as it is functionally correct and should be beneficial for performance.
This commit adds initial support for scaled_dot with mxfp8 LHS and fp8 RHS. It supports both mfma32 and mfma16 intrinsic variants. Right now we are missing software emulation for `Float8E4M3FN` type, so this only enables for `Float8E5M2`.
…`interpreter.cc` (triton-lang#4976) `#include <atomic>` is already used in other triton files, so I believe it's not a cardinally change. Changes come from triton-lang#4045
This commit removes special cases for MFMA -> Dot Operand LDS shortcuts. Now it is supported by common linear layout infrastructure. No tests are added, mfma-shortcut.mlir already testing this.
`scaled_dot` is not yet implemented on `gfx11` and `gfx12` so disable unit tests for now.
The string representation allows PyTorch Inductor to serialize/derserialize the `AttrsDescriptor` to the `@triton.heuristics` block in the generated code.
…-lang#5009) Allows for upcasting in DotOp encoding in RF. This lowering path is not currently in use; pending triton-lang#5003
…indows (triton-lang#5014) The `-A` argument is not compatible with the Ninja generator. Signed-off-by: Anatoly Myachev <[email protected]>
…lang#4996) In the passing we also improve a few other things: - Now `scaled_dot` accepts both uint8/uint16 fp8/bf16 as inputs (before you had to cast it to uint8, which was weird when extending it to bf16). - Add `scaled_dot` to the docs and improve the docs overall (have not render them, might need a few further tweaks)
…Py HALF-related code (triton-lang#5010) Closes triton-lang#4992
Example: ```python # On Windows >>> sysconfig.get_config_var("EXE") '.exe' # On Linux >>> sysconfig.get_config_var("EXE") '' ``` --------- Signed-off-by: Anatoly Myachev <[email protected]>
…4935) This PR adds more restrictions about when should we apply the sched-load optimizations and un-revert triton-lang#4823. We will only apply the optimization when all of the following is satisfied: 1. pureMatmulProblem, i.e. 1 `tt.dot` in the main loop 2. two `tt.load`s in the main loop 3. 2nd `tt.load` is ahead of the `tt.dot` 4. 1st user of 2nd `tt.load` is after the `tt.dot` 5. tile size is large enough, i.e. nonKDim >= 128 and kDim >= 64
…eduling (triton-lang#4814) We call assignMemoryLayouts and set up distToUse right before lowering. During scheduling, we call a helper function that checks if a load can be pipelined. loadOpToIndLevelAndUse is used during scheduling and also in assignMemoryLayouts. The next step is to clean up that logic so lowering will not use loadOpToIndLevelAndUse, instead it checks the schedule and finds out which loads are pipelined (i.e uses for loads are in different stages from the loads).
1. Remove unnecessary header files 2. Remove unused `getCvtOrder` since dot operand now has its order defined 3. Remove unnecessary forward declarations
This commit relands triton-lang#4819 with the following fixes: * Changed to a better way to mark opIdx for loads * Replaced temlate-based `rewindUnaryOps` to use regular for-loops. The new way is more robust and can handle other unary ops automatically. * Replaced `instr.sched.barriers` using the ones from `rocdl` dialect from the MLIR upstream * Extended lit tests
…4955) Continuation of the work from @lezcano triton-lang#4698 > With this PR, we make `%` always follow C semantics, similar to `//`. We update the type promotion docs fixing some inaccuracies. It is still not entirely precise though. For a discussion of the current semantics see triton-lang#4697 Pretty sure all that was left were changes for the frem function to emit `np.fmod` instead of `np.remainder` and to ignore ('uint16', 'float64') mod computations in the tests. I believe this combination is ill-conditioned but I could be wrong about that. Co-authored-by: lezcano <[email protected]>
…-lang#5035) Add a barrier to avoid a race condition in case an assert is followed by an op that may trap if the assert condition is true. Since the tensor in those two operations may have different layout we need to make sure all the threads are done executing the assert before going to the next op.
This commit adds support for mxfp4 typed A tensor for sacled dot in the AMD backend. We moved the `convertMxfp4x2ToBf16x2` impl from NVIDIA side to a common path to reuse.
support asserts with scalar condition and only emit barrier for assert of tensors. Thanks to @peterbell10 for the suggestion.
…e` (triton-lang#5036) The new code structure is shorter and also allows to get rid of one level of code nesting in most places. As a side effect, it makes the code more Windows-friendly. For example, it eliminates situations when an attempt is made to open a file for reading, while a file with the same name is already open for writing: https://github.com/intel/intel-xpu-backend-for-triton/pull/2478/files#r1805224201 (what doesn't work on Windows). Pytest' docs: https://docs.pytest.org/en/stable/how-to/tmp_path.html
…able thermal state (triton-lang#5042) Following the nvidia's recipe for measuring performance in 09-persistent-matmul.py tutorial: get system into a stable thermal state by using long warmup run, then do 1000 runs of benchmark. We couldn't done it in the beginning because creating and passing TMA descriptors was creating GPU bubble that allowed GPU to cool down, thus not reaching equilibrium, skewing TMA kernel results towards unfair higher scores. With changes around passing descriptors via grid constants I see results very close to the version with descriptor re-use, so we can now use this methodology and get correct benchmarking results. Example cmd line for measuring perf of fp8 matmul across K=[512, 8192]: `python 09-persistent-matmul.py --prec fp8 --K_range 512 8192`
…triton-lang#5031) The CombineTensorSelectAndIf pass currently doesn’t work correctly **when the user of select is inside the scf.if block**. For example: ```mlir %select = arith.select %cond, %trueVal, %falseVal : i32 %if = scf.if %cond -> (i32) { %sub = arith.subi %select, %val1 : i32 scf.yield %sub : i32 } else { %mul = arith.muli %select, %val2 : i32 scf.yield %mul : i32 } use %select ``` In this case, dom.dominates(ifOp, user) will return true, but directly using replaceAllUsesWith would lead to incorrect replacement behavior. ```mlir // without this pr (the user in ifOp use the result of ifOp) %if:2 = scf.if %cond -> (i32, i32) { %sub = arith.subi %if#1, %val1 : i32 scf.yield %sub, %trueVal : i32, i32 } else { %mul = arith.muli %if#1, %val2 : i32 scf.yield %mul, %falseVal : i32, i32 } use %if#1 ``` To address this, we need to adjust the user’s operand based on the specific region it is in. ```mlir // with this pr (the user in ifOp be canonicaled first) %if:2 = scf.if %cond -> (i32, i32) { %sub = arith.subi %trueVal, %val1 : i32 scf.yield %sub, %trueVal : i32, i32 } else { %mul = arith.muli %falseVal, %val2 : i32 scf.yield %mul, %falseVal : i32, i32 } use %if#1 ```
…-lang#5041) It's not entirely clear to me whether the previous logic was equivalent to this one, as it was rather obtuse. I think the new one is optimal but I'm happy to run benchmarks to make sure we don't regress.
Two bugfixes following triton-lang#5009. - When `BLOCK_M=64` and `num_warps > 4`, the order of warps for DotOpEncoded tensor should be M-major instead of N-major, since WGMMA expects the 4 warps in each warp group to be stacked along the M dimension. - Should use `mmaBitwidth` instead of `bitwidth` when calculating `numRep` in `SharedToDotOperandMMAv2OrV3`. This was missed in a bad rebase. @lezcano I encountered these bugs when attempting to locally test the [DotOp hoisting PR](triton-lang#5003) after rebasing (they normally would be caught by `test_core.py` but that path was not yet enabled in the last PR). With these fixes added, I was able to successfully validate against pytorch.
…-lang#5051) Signed-off-by: Anatoly Myachev <[email protected]>
Signed-off-by: Anatoly Myachev <[email protected]>
Fold fp_to_fp op with a zero constant input into a zero constant with fp_to_fp op destination type. --------- Co-authored-by: Ognjen Plavsic <[email protected]> Co-authored-by: Lei Zhang <[email protected]>
This commit enables pipeliner test for scaled dot on the AMD backend. Along the way, unified some target/arch probe utilities into the common `_internal_testing` file.
triton-lang#5064) Bumping llvm to include a loop unroller fix: llvm/llvm-project#114573. This is needed for subsequent loop unroller upstreaming work.
…triton-lang#5027) The epilog loop created by the loop unroller may not be run if the main unrolled loop covers all original loop iterations, thus pipelining it non-speculatively may not be beneficial. It can also cause some correctness issue when combined with the downstream PTXAS optimizer.
…riton-lang#5055) We use `getOrder` very liberally throughout the codebase, when we really meant to use `getThreadOrder`. This is an issue with the input layout is an `DotOperand(mma(opIdx=1))`, where the thread order and the matrix order are opposite. Found this to be an issue when a PR changed the `getOrder` of `DotOperand(Hopper)` to an incorrect one and CI still passed! The issue here is that the LLVM lowering for wgmma and the LinearLayout does not use `getOrder`, but there are many other subsystems do, and many heuristics would be getting an incorrect order, and potentially be disabled. This is particularly problematic for `DotOperand(opIdx=1)` in nvidia hardware, as `getThreadOrder` and `getOrder` are different! While doing so we: - Audit most (all?) the calls to `getOrder(dotOperand)`. It turns out that most of them really meant `getThreadOrder` - Fix the ordering methods of `SliceEncodingAttr` to be consistent - Move the implementation of `getWarpOrder` to the Attr classes, because of OOP The test strategy was to add `llvm::report_fatal_error("Testing");` within `getOrder(nvidiaMma)` and `getOrder(DotOperand(nvidiaMma))` and triaging all errors that were raised in CI.
There were a number of complex regexes used for parsing MLIR in the python frontend. For maintainability reasons, it is likely better to just expose the MLIR bindings to python and use those instead. The PTX regex is left in place because we don't have an easy way to parse PTX (for now).
…on (triton-lang#5033) Reductions have special handling for side effectful "combine ops" (e.g. "add" for a sum reduction). In the presence of side effects, a predicate is computed to determine whether a thread should participate in the reduction, to ensure that invalid/uninitialized data is not operated on. See triton-lang#4811 for more details. ~Previously, the predicate logic was incorrect for 2D reductions. This PR fixes the logic and adds a python test.~ Edit: after additional discussion with @peterbell10, we removed the lanePred logic. Here's our thinking on why this is valid: * lanePred info is computed based entirely on the blocked layout info and properties of the reduction * the blocked layout won't tell you which threads do or don't have uninitialized data Instead, it sounds like the motivation for triton-lang#4811 is based on uninitialized values that can be indicated by the `pred` variable passed into `warpReduce()`.
Adding AMD support to convert fp8E4M3FN (aka fp8e4nv) to bf16. This will allow fp8E4M3FN to be emulated by bf16 to support scaled_dot(). Previously fp8E4M3FN wasn't supported at all, and this PR adds only a single conversion to bf16.
Signed-off-by: Anatoly Myachev <[email protected]> Co-authored-by: peterbell10 <[email protected]>
…riton-lang#5075) This is a follow up to triton-lang#5033 but for scan ops, and also improving the testing as it was clearly insufficient before.
It's a bit annoying to have to remember to pass `TRITON_BUILD_WITH_CCACHE` every time you invoke `pip install`. This changes the default to check if ccache is installed and use it if it's found, with the option to manually disable ccache if it's not wanted for some reason.
These two tests causes significant CI runtime increase (10x) so disabling them to unblock while figuring out issues.
This pulls in some new symbols so we don't need to use hardcoded numbers later. This cherry-picks triton-lang#3989 again to fix macOS builds.
…-lang#5074) triton-lang#5062 enabled upcasting fp8E4M3FN to bf16; so now we can support that variant too.
…le (triton-lang#5079) ExternElementwise ops have a `pure` attribute that marks the op as pure. If an op is pure, it should also be speculatable. In the reduction/scan ttgir->llvm passes, checks for speculatability are failing for ExternElementwise ops, causing additional conditional handling to be added. This PR makes ExternElementwise ops implement ConditionallySpeculatable, and mark the op as speculatable if the op is marked as pure. This removes the conditional branches from the generated scan/reduction code.
Couple of places where not handling slice layout inputs for reductions. Add support for recursive slice layout in those cases.
…riton-lang#5081) Context: in `CodeGenerator.__init__`, globals for a given triton function are modified to handle remapping the libdevice module to cuda or hip (from triton-lang#4539). In particular, this logic: ```python for k, v in gscope.items(): # gscope is a dict of fn.__globals__ ... self.gscope[k] = getattr(module_map[module_name], k) ``` was failing if you do this in the global scope: `from triton.language.extras.libdevice import fast_dividef as my_fast_dividef`.
- Removed functions related to unpacking and packing I32 values. - Updated utilities to handle conversion of mxfp4 values without packing/unpacking I32. - Move the register value ordering logic from the element-wise operation lowering to the dot operation lowering. - Use linear layout to handle conversions between almost all distributed layouts. - Clean up data loading and mma computation involving `repN`, `repK`, and `repM`.
something wrong, let's close this PR |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
streamk gemm script v0.3