Triton and Torch matmul result do not really match? #1334

zw2326 · 2023-03-13T23:24:07Z

zw2326
Mar 13, 2023

With the code from the matmul tutorial (commenting out the accumulator.to(float16) line, leaving only the first setting in autotune: 'BLOCK_SIZE_M': 128, 'BLOCK_SIZE_N': 256, 'BLOCK_SIZE_K': 32, 'GROUP_SIZE_M': 8), I tried using the following two integer matrices as the input. I expect the Triton and Torch results to match exactly (come on, these are small integers):

M, K, N = 128, 32, 32
a_data = [x + 1 for x in range(M * K)]
a = torch.Tensor(a_data).to("cuda").to(torch.float32).view(M, K).contiguous()
b_data = [x + 1 for x in range(K * N)]
b = torch.Tensor(b_data).to("cuda").to(torch.float32).view(K, N).contiguous()

The output is of shape [128, 32]. For rows [0, 63], the results do match exactly. Starting from row 64, there is systematic difference:

# This is the same for every row in [64, 127]!
>>> torch_output[64]-triton_output[64]
tensor([7696., 7712., 7728., 7744., 7760., 7776., 7792., 7808., 7824., 7840.,
        7856., 7872., 7888., 7904., 7920., 7936., 7952., 7968., 7984., 8000.,
        8016., 8032., 8048., 8064., 8080., 8096., 8112., 8128., 8144., 8160.,
        8176., 8192.], device='cuda:0')

Wonder if anyone can reproduce? Thanks!

Answered by bertmaher

Mar 14, 2023

So, I repro'ed this, and it seemed like an oddly large difference to me, until I remembered that Triton is almost certainly using tf32, and that while torch may be using tf32 (if you torch.set_float32_matmul_precision("high")), it may not be! In fact for a problem this small, I observe ampere_sgemm_* kernels in the profile, which are fp32 (not tf32) kernels.

Comparing gemms for equality is hard :). I usually like an approach like this: https://twitter.com/bwasti/status/1621370782436687872 Basically, torch.randn(shape)+1.0)/k

View full answer

bertmaher · 2023-03-14T03:10:18Z

bertmaher
Mar 14, 2023
Collaborator

So, I repro'ed this, and it seemed like an oddly large difference to me, until I remembered that Triton is almost certainly using tf32, and that while torch may be using tf32 (if you torch.set_float32_matmul_precision("high")), it may not be! In fact for a problem this small, I observe ampere_sgemm_* kernels in the profile, which are fp32 (not tf32) kernels.

Comparing gemms for equality is hard :). I usually like an approach like this: https://twitter.com/bwasti/status/1621370782436687872 Basically, torch.randn(shape)+1.0)/k

1 reply

zw2326 Mar 14, 2023
Author

Thanks for the insights!

I just tried tl.dot(a, b, allow_tf32=False), and it did greatly reduce the difference (although still a small difference in rows [64, 127], not sure where that comes from; and using tl.float64 for accumulator doesn't help), e.g.

>>> torch_output[64] - triton_output[64]

tensor([ 4.,  0., -4.,  0.,  4.,  0., -4.,  0.,  4.,  0., -4.,  0.,  4.,  0.,
        -4.,  0.,  4.,  0., -4.,  0.,  4.,  0., -4.,  0.,  4.,  0., -4.,  0.,
         4.,  0., -4.,  0.], device='cuda:0')

Didn't realize these integers aren't really small integers :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Triton and Torch matmul result do not really match? #1334

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Triton and Torch matmul result do not really match? #1334

zw2326 Mar 13, 2023

Replies: 1 comment · 1 reply

bertmaher Mar 14, 2023 Collaborator

zw2326 Mar 14, 2023 Author

zw2326
Mar 13, 2023

Replies: 1 comment 1 reply

bertmaher
Mar 14, 2023
Collaborator

zw2326 Mar 14, 2023
Author