-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[TIR][Schedule] affine binding and more #432
Comments
I agree that clarifying part of the iterator space as affine might be something we want to think a bit more |
Some addtional thoughts on correctness. Affine binding is not exact affine# Cannot parallel i
for i in tir.grid(8):
with tir.block([8]) as [vi]:
tir.write(A[vi : vi + 2])
A[vi] = vi
A[vi + 1] = vi # Cannot reorder i, j
for i, j in tir.grid(8, 8)
with tir.block([8, 8]) as [vi, vj]:
tir.bind(vi, i)
tir.bind(vj, j)
A[vi, vj] = B[vi - 1, vj + 1]
B[vi, vj] = 1 Block isolation is not strong enoughWe design
with tir.block([1, 1]):
wmma.sync(A, B, C) Tensorcore is expected to be a warp-level operation. It requires 32-threadIdx working together. However, by looking at the block signature, we can not know the constraint. Information lacking may influence additional scheduling. Things are not only happened in TensorCore. Most of the opaque intrinsics have their unique constraint.
# Cannot reorder i, k
for i, k in tir.grid(...):
with tir.block([8, tir.reduce(8)]) as [vi, vk]:
tir.bind(vi, i)
tir.bind(vk, k)
A[i] += B[i, k]
C[i] = A[i] Some thoughts
|
Great point, one thing we need to keep in mind is our usecase. For example, it would be great to be able to have a clear block isolation for wmma because this seems to be the key need for our major usecase. We can however, tighten up some of the corner cases as long as we have the things we need in the search space. |
Recently @Hzfengsy brought up a question regarding affine binding and related schedule primitives.
After brief discussions, I put my thoughts here for further discussions.
Intro case
The above code is a simplified pooling operator, and if we do following schedule transformations, we will get
Note that the cache read block's binding is not affine.
But we may still want to
which will bring problems to the current schedule transformation.
Affine binding and parallelization
A clear motivation for affine binding is shown below
The block above is indeed a complete block, but it is incorrect to parallel(io).
The current parallel algorithm doesn't reject such cases.
If we add C in its read buffers, then the complete block check will work. But I still think if it is worth discussing whether it is OK to do so.
Affine binding and reordering
It's incorrect to reorder k, io, ii.
Affine binding and blockization
Previously, I implemented subspace division for affine bindings to do blockization, since we need to generate reasonable bindings for the result outer and inner blocks.
If we face the above tensorization need, affine binding doesn't work anymore.
A somewhat ad-hoc fix I can come up with is to generalized subspace division to work under vi = affine(outer loops) + some_inner _loop.
Would be great to hear your opinions.
cc @tqchen @Hzfengsy @junrushao1994 @MasterJH5574 @jinhongyii @yzh119
The text was updated successfully, but these errors were encountered: