Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streamkv0.3 #659

Closed
wants to merge 61 commits into from
Closed

Streamkv0.3 #659

wants to merge 61 commits into from

Conversation

xiaohuguo2023
Copy link
Member

streamk gemm script v0.3

  • new persistent gemm kernel
  • gemm benchmark tool using nearest neighbour approach.

Jokeren and others added 30 commits October 28, 2024 11:08
…triton-lang#4951)

This PR removes the legacy `isMmaToDotShortcut` and its associated shortcut conversion.
…odegen bug (triton-lang#4873)" (triton-lang#4973)

After investigation of the differences caused by
triton-lang#4774 in the internal tests,
we concluded that they were introduced by change in the layouts selected
for the reduce operations. Re-introducing that change, as it is
functionally correct and should be beneficial for performance.
This commit adds initial support for scaled_dot with
mxfp8 LHS and fp8 RHS. It supports both mfma32
and mfma16 intrinsic variants.

Right now we are missing software emulation for
`Float8E4M3FN` type, so this only enables for
`Float8E5M2`.
…`interpreter.cc` (triton-lang#4976)

`#include <atomic>` is already used in other triton files, so I believe
it's not a cardinally change.

Changes come from triton-lang#4045
This commit removes special cases for MFMA -> Dot Operand
LDS shortcuts. Now it is supported by common linear layout
infrastructure.

No tests are added, mfma-shortcut.mlir already testing this.
`scaled_dot` is not yet implemented on `gfx11` and `gfx12`
so disable unit tests for now.
The string representation allows PyTorch Inductor to
serialize/derserialize the `AttrsDescriptor` to the `@triton.heuristics`
block in the generated code.
…-lang#5009)

Allows for upcasting in DotOp encoding in RF.
This lowering path is not currently in use; pending
triton-lang#5003
…indows (triton-lang#5014)

The `-A` argument is not compatible with the Ninja generator.

Signed-off-by: Anatoly Myachev <[email protected]>
…lang#4996)

In the passing we also improve a few other things:
- Now `scaled_dot` accepts both uint8/uint16 fp8/bf16 as inputs (before
you had to cast it to uint8, which was weird when extending it to bf16).
- Add `scaled_dot` to the docs and improve the docs overall (have not
render them, might need a few further tweaks)
Example:

```python
# On Windows
>>> sysconfig.get_config_var("EXE")
'.exe'

# On Linux
>>> sysconfig.get_config_var("EXE")
''
```

---------

Signed-off-by: Anatoly Myachev <[email protected]>
…4935)

This PR adds more restrictions about when should we apply
the sched-load optimizations and un-revert
triton-lang#4823.

We will only apply the optimization when all of the following is
satisfied:
1. pureMatmulProblem, i.e. 1 `tt.dot` in the main loop
2. two `tt.load`s in the main loop
3. 2nd `tt.load` is ahead of the `tt.dot`
4. 1st user of 2nd `tt.load` is after the `tt.dot`
5. tile size is large enough, i.e. nonKDim >= 128 and kDim >= 64
…eduling (triton-lang#4814)

We call assignMemoryLayouts and set up distToUse right before lowering.
During scheduling, we call a helper function that checks if a load can
be pipelined.

loadOpToIndLevelAndUse is used during scheduling and also in
assignMemoryLayouts. The next step is to clean up that logic so lowering
will not use loadOpToIndLevelAndUse, instead it checks the schedule and
finds out which loads are pipelined (i.e uses for loads are in different
stages from the loads).
1. Remove unnecessary header files
2. Remove unused `getCvtOrder` since dot operand now has its order
defined
3. Remove unnecessary forward declarations
This commit relands triton-lang#4819
with the following fixes:

* Changed to a better way to mark opIdx for loads
* Replaced temlate-based `rewindUnaryOps` to use regular
  for-loops. The new way is more robust and can handle other
  unary ops automatically.
* Replaced `instr.sched.barriers` using the ones from
  `rocdl` dialect from the MLIR upstream
* Extended lit tests
…4955)

Continuation of the work from @lezcano
triton-lang#4698

> With this PR, we make `%` always follow C semantics, similar to `//`.
We update the type promotion docs fixing some inaccuracies. It is still
not entirely precise though. For a discussion of the current semantics
see triton-lang#4697

Pretty sure all that was left were changes for the frem function to emit
`np.fmod` instead of `np.remainder` and to ignore ('uint16', 'float64')
mod computations in the tests. I believe this combination is
ill-conditioned but I could be wrong about that.

Co-authored-by: lezcano <[email protected]>
…-lang#5035)

Add a barrier to avoid a race condition in case an assert is followed by
an op that may trap if the assert condition is true. Since the tensor in
those two operations may have different layout we need to make sure all
the threads are done executing the assert before going to the next op.
This commit adds support for mxfp4 typed A tensor
for sacled dot in the AMD backend.

We moved the `convertMxfp4x2ToBf16x2` impl
from NVIDIA side to a common path to reuse.
support asserts with scalar condition and only emit barrier for assert
of tensors.
Thanks to @peterbell10 for the suggestion.
…e` (triton-lang#5036)

The new code structure is shorter and also allows to get rid of one
level of code nesting in most places.

As a side effect, it makes the code more Windows-friendly. For example,
it eliminates situations when an attempt is made to open a file for
reading, while a file with the same name is already open for writing:
https://github.com/intel/intel-xpu-backend-for-triton/pull/2478/files#r1805224201
(what doesn't work on Windows).

Pytest' docs: https://docs.pytest.org/en/stable/how-to/tmp_path.html
…able thermal state (triton-lang#5042)

Following the nvidia's recipe for measuring performance in
09-persistent-matmul.py tutorial: get system into a stable thermal state
by using long warmup run, then do 1000 runs of benchmark.
We couldn't done it in the beginning because creating and passing TMA
descriptors was creating GPU bubble that allowed GPU to cool down, thus
not reaching equilibrium, skewing TMA kernel results towards unfair
higher scores. With changes around passing descriptors via grid
constants I see results very close to the version with descriptor
re-use, so we can now use this methodology and get correct benchmarking
results.

Example cmd line for measuring perf of fp8 matmul across K=[512, 8192]:
`python 09-persistent-matmul.py --prec fp8 --K_range 512 8192`
tfruan2000 and others added 28 commits November 3, 2024 08:35
…triton-lang#5031)

The CombineTensorSelectAndIf pass currently doesn’t work correctly
**when the user of select is inside the scf.if block**.

For example:

```mlir
%select = arith.select %cond, %trueVal, %falseVal : i32
%if = scf.if %cond -> (i32) {
  %sub = arith.subi %select, %val1 : i32
  scf.yield %sub : i32
} else {
  %mul = arith.muli %select, %val2 : i32
  scf.yield %mul : i32
}
use %select
```

In this case, dom.dominates(ifOp, user) will return true, but directly
using replaceAllUsesWith would lead to incorrect replacement behavior.

```mlir
// without this pr (the user in ifOp use the result of ifOp) 
%if:2 = scf.if %cond -> (i32, i32) {
  %sub = arith.subi %if#1, %val1 : i32
  scf.yield %sub, %trueVal : i32, i32
} else {
  %mul = arith.muli %if#1, %val2 : i32
  scf.yield %mul, %falseVal : i32, i32
}
use %if#1
```


To address this, we need to adjust the user’s operand based on the
specific region it is in.

```mlir
// with this pr (the user in ifOp be canonicaled first)
%if:2 = scf.if %cond -> (i32, i32) {
  %sub = arith.subi %trueVal, %val1 : i32
  scf.yield %sub, %trueVal : i32, i32
} else {
  %mul = arith.muli %falseVal, %val2 : i32
  scf.yield %mul, %falseVal : i32, i32
}
use %if#1
```
…-lang#5041)

It's not entirely clear to me whether the previous logic was equivalent
to this one, as it was rather obtuse. I think the new one is optimal but
I'm happy to run benchmarks to make sure we don't regress.
Two bugfixes following triton-lang#5009.

- When `BLOCK_M=64` and `num_warps > 4`, the order of warps for
DotOpEncoded tensor should be M-major instead of N-major, since WGMMA
expects the 4 warps in each warp group to be stacked along the M
dimension.
- Should use `mmaBitwidth` instead of `bitwidth` when calculating
`numRep` in `SharedToDotOperandMMAv2OrV3`. This was missed in a bad
rebase.

@lezcano I encountered these bugs when attempting to locally test the
[DotOp hoisting PR](triton-lang#5003)
after rebasing (they normally would be caught by `test_core.py` but that
path was not yet enabled in the last PR). With these fixes added, I was
able to successfully validate against pytorch.
Fold fp_to_fp op with a zero constant input into a zero constant with
fp_to_fp op destination type.

---------

Co-authored-by: Ognjen Plavsic <[email protected]>
Co-authored-by: Lei Zhang <[email protected]>
This commit enables pipeliner test for scaled dot
on the AMD backend.

Along the way, unified some target/arch probe
utilities into the common `_internal_testing` file.
triton-lang#5064)

Bumping llvm to include a loop unroller fix:
llvm/llvm-project#114573. This is needed for
subsequent loop unroller upstreaming work.
…triton-lang#5027)

The epilog loop created by the loop unroller may not be run if the main
unrolled loop covers all original loop iterations, thus pipelining it
non-speculatively may not be beneficial. It can also cause some
correctness issue when combined with the downstream PTXAS optimizer.
…riton-lang#5055)

We use `getOrder` very liberally throughout the codebase, when we really
meant to use `getThreadOrder`. This is an issue with the input layout is
an
`DotOperand(mma(opIdx=1))`, where the thread order and the matrix order
are opposite.

Found this to be an issue when a PR changed the `getOrder` of
`DotOperand(Hopper)` to an incorrect one and CI still passed! The issue
here is that the LLVM lowering for wgmma and the LinearLayout does not
use `getOrder`, but there are many other subsystems do, and many
heuristics would be getting an incorrect order, and potentially be
disabled.

This is particularly problematic for `DotOperand(opIdx=1)` in nvidia
hardware, as `getThreadOrder` and `getOrder` are different!

While doing so we:
- Audit most (all?) the calls to `getOrder(dotOperand)`. It turns out
that most of them really meant `getThreadOrder`
- Fix the ordering methods of `SliceEncodingAttr` to be consistent
- Move the implementation of `getWarpOrder` to the Attr classes, because
of OOP

The test strategy was to add `llvm::report_fatal_error("Testing");`
within `getOrder(nvidiaMma)` and `getOrder(DotOperand(nvidiaMma))` and
triaging all errors that were raised in CI.
There were a number of complex regexes used for parsing MLIR in the
python frontend. For maintainability reasons, it is likely better to
just expose the MLIR bindings to python and use those instead.

The PTX regex is left in place because we don't have an easy way to
parse PTX (for now).
…on (triton-lang#5033)

Reductions have special handling for side effectful "combine ops" (e.g.
"add" for a sum reduction). In the presence of side effects, a predicate
is computed to determine whether a thread should participate in the
reduction, to ensure that invalid/uninitialized data is not operated on.
See triton-lang#4811 for more details.

~Previously, the predicate logic was incorrect for 2D reductions. This
PR fixes the logic and adds a python test.~

Edit: after additional discussion with @peterbell10, we removed the
lanePred logic. Here's our thinking on why this is valid:
* lanePred info is computed based entirely on the blocked layout info
and properties of the reduction
* the blocked layout won't tell you which threads do or don't have
uninitialized data

Instead, it sounds like the motivation for triton-lang#4811 is based on
uninitialized values that can be indicated by the `pred` variable passed
into `warpReduce()`.
Adding AMD support to convert fp8E4M3FN (aka fp8e4nv) to bf16. This will
allow fp8E4M3FN to be emulated by bf16 to support scaled_dot().
Previously fp8E4M3FN wasn't supported at all, and this PR adds only a
single conversion to bf16.
Signed-off-by: Anatoly Myachev <[email protected]>
Co-authored-by: peterbell10 <[email protected]>
…riton-lang#5075)

This is a follow up to triton-lang#5033 but for scan ops, and also improving the
testing as it was clearly insufficient before.
It's a bit annoying to have to remember to pass
`TRITON_BUILD_WITH_CCACHE` every time you invoke `pip install`.

This changes the default to check if ccache is installed and use it if
it's found, with the option to manually disable ccache if it's not
wanted for some reason.
These two tests causes significant CI runtime increase (10x)
so disabling them to unblock while figuring out issues.
This pulls in some new symbols so we don't need to use
hardcoded numbers later.

This cherry-picks triton-lang#3989
again to fix macOS builds.
…-lang#5074)

triton-lang#5062 enabled
upcasting fp8E4M3FN to bf16; so now we can support
that variant too.
…le (triton-lang#5079)

ExternElementwise ops have a `pure` attribute that marks the op as pure.
If an op is pure, it should also be speculatable. In the reduction/scan
ttgir->llvm passes, checks for speculatability are failing for
ExternElementwise ops, causing additional conditional handling to be
added.

This PR makes ExternElementwise ops implement ConditionallySpeculatable,
and mark the op as speculatable if the op is marked as pure. This
removes the conditional branches from the generated scan/reduction code.
Couple of places where not handling slice layout inputs for reductions.
Add support for recursive slice layout in those cases.
…riton-lang#5081)

Context: in `CodeGenerator.__init__`, globals for a given triton
function are modified to handle remapping the libdevice module to cuda
or hip (from triton-lang#4539). In
particular, this logic:

```python
for k, v in gscope.items():  # gscope is a dict of fn.__globals__
  ...
  self.gscope[k] = getattr(module_map[module_name], k)
```

was failing if you do this in the global scope: `from
triton.language.extras.libdevice import fast_dividef as
my_fast_dividef`.
- Removed functions related to unpacking and packing I32 values.
- Updated utilities to handle conversion of mxfp4 values without
packing/unpacking I32.
- Move the register value ordering logic from the element-wise operation
lowering to the dot operation lowering.
- Use linear layout to handle conversions between almost all distributed
layouts.
- Clean up data loading and mma computation involving `repN`, `repK`,
and `repM`.
@xiaohuguo2023
Copy link
Member Author

something wrong, let's close this PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.