[Tutorial] Remove incorrect caching from softmax tutorial #5162

Mogball · 2024-11-15T06:30:54Z

The fused softmax implementation in the tutorial precompiles the kernel to query the register usage of the kernel, based on the parameters used to specialize the kernel. On top of this, it implements a simple caching system for this step based on just the block size.

As noted in #4739, this caching is incorrect, because it's also not keyed on the num_stages constexpr argument or the shapes of the tensors. Since triton already has its own JIT compilation cache, and this caching bit is not really relevant to the tutorial, just remove it to get rid of the footgun.

The fused softmax implementation in the tutorial precompiles the kernel to query the register usage of the kernel, based on the parameters used to specialize the kernel. On top of this, it implements a simple caching system for this step based on just the block size. As noted in triton-lang#4739, this caching is incorrect, because it's also not keyed on the `num_stages` constexpr argument or the shapes of the tensors. Since triton already has its own JIT compilation cache, and this caching bit is not really relevant to the tutorial, just remove it to get rid of the footgun.

peterbell10 · 2024-11-15T14:46:03Z

python/tutorials/02-fused-softmax.py

-        occupancy = min(occupancy, SIZE_SMEM // size_smem)
-        num_programs = NUM_SM * occupancy
-        kernels[BLOCK_SIZE] = (kernel, num_programs)
+    kernel = softmax_kernel.warmup(y, x, x.stride(0), y.stride(0), n_rows, n_cols, BLOCK_SIZE=BLOCK_SIZE,


I think there is still a bug here because warmup uses MockTensor:

triton/python/triton/runtime/jit.py

Line 729 in 1cf06c5

return self.run(grid=grid, warmup=True, *map(MockTensor.wrap_dtype, args), **kwargs)

And MockTensor doesn't respect the real pointer's alignment:

triton/python/triton/runtime/jit.py

Lines 879 to 881 in 1cf06c5

@staticmethod

def data_ptr():

return 0 # optimistically assumes multiple of 16

This seems questionable though. I'm not sure why warmup couldn't just operate on the real tensors since it doesn't actually run any device code.

Nice catch! Let me try digging into this

Mogball requested a review from ptillet as a code owner November 15, 2024 06:30

Mogball requested a review from pawelszczerbuk November 15, 2024 06:31

peterbell10 reviewed Nov 15, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Tutorial] Remove incorrect caching from softmax tutorial #5162

[Tutorial] Remove incorrect caching from softmax tutorial #5162

Mogball commented Nov 15, 2024

peterbell10 Nov 15, 2024

Mogball Nov 15, 2024

	@staticmethod
	def data_ptr():
	return 0 # optimistically assumes multiple of 16

[Tutorial] Remove incorrect caching from softmax tutorial #5162

Are you sure you want to change the base?

[Tutorial] Remove incorrect caching from softmax tutorial #5162

Conversation

Mogball commented Nov 15, 2024

peterbell10 Nov 15, 2024

Choose a reason for hiding this comment

Mogball Nov 15, 2024

Choose a reason for hiding this comment