Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TIR][Schedule] affine binding and more #432

Open
spectrometerHBH opened this issue Aug 9, 2021 · 3 comments
Open

[TIR][Schedule] affine binding and more #432

spectrometerHBH opened this issue Aug 9, 2021 · 3 comments

Comments

@spectrometerHBH
Copy link
Collaborator

spectrometerHBH commented Aug 9, 2021

Recently @Hzfengsy brought up a question regarding affine binding and related schedule primitives.
After brief discussions, I put my thoughts here for further discussions.

Intro case

for i in 16:
    with tir.block([16]) as vi:
        tir.bind(vi, i)
        A[vi] = tir.max(B[vi], B[vi + 1])

The above code is a simplified pooling operator, and if we do following schedule transformations, we will get

# split
# cache_read
for i in 17:
    with tir.block([17]) as vi:
        B_shard[i] = B[i]

for io in 4:
    for ii in 4:
        with tir.block([16]) as vi:
            tir.bind(vi, io * 4 + ii)
            A[i] = tir.max(B_shared[i], B_shared[i + 1])

# compute_at
for io in 4:
    for ii in 5:
        with tir.block([17]) as vi:
            tir.bind(vi, io * 4 + ii)
            B_shared[vi] = B[vi]
    for ii in 4:
        with tir.block([16]) as vi:
            tir.bind(vi, io * 4 + ii)
            A[vi] = tir.max(B_shared[vi], B_shared[vi + 1])

Note that the cache read block's binding is not affine.
But we may still want to

  1. parallel the outer loop io
  2. tensorize the cache read block
    which will bring problems to the current schedule transformation.

Affine binding and parallelization

A clear motivation for affine binding is shown below

for io in 4:
    for ii in 5:
        with tir.block([17]) as vi:
            tir.bind(vi, io * 4 + ii)
            C[vi] = B[vi]
            C[vi] = C[vi] + 1

The block above is indeed a complete block, but it is incorrect to parallel(io).
The current parallel algorithm doesn't reject such cases.
If we add C in its read buffers, then the complete block check will work. But I still think if it is worth discussing whether it is OK to do so.

Affine binding and reordering

for io in 4:
    for ii in 5:
        for k in 10:
            with tir.block([17, tir.reduce(10)]) as vi, vk:
                tir.bind(vi, io * 4 + ii)
                tir.bind(vk, k)
                with tir.init():
                    C[vi] = 0.0
                C[vi] += A[vi, vk]

It's incorrect to reorder k, io, ii.

Affine binding and blockization

for io in 4:
    for ii in 5:
        with tir.block([17]) as vi:
            tir.bind(vi, io * 4 + ii)
            B_shared[vi] = B[vi]

Previously, I implemented subspace division for affine bindings to do blockization, since we need to generate reasonable bindings for the result outer and inner blocks.
If we face the above tensorization need, affine binding doesn't work anymore.

A somewhat ad-hoc fix I can come up with is to generalized subspace division to work under vi = affine(outer loops) + some_inner _loop.

Would be great to hear your opinions.

cc @tqchen @Hzfengsy @junrushao1994 @MasterJH5574 @jinhongyii @yzh119

@tqchen
Copy link
Contributor

tqchen commented Aug 9, 2021

I agree that clarifying part of the iterator space as affine might be something we want to think a bit more

@Hzfengsy
Copy link
Member

Hzfengsy commented Aug 16, 2021

Some addtional thoughts on correctness.

Affine binding is not exact affine

# Cannot parallel i
for i in tir.grid(8):
    with tir.block([8]) as [vi]:
        tir.write(A[vi : vi + 2])
        A[vi] = vi
        A[vi + 1] = vi
# Cannot reorder i, j
for i, j in tir.grid(8, 8)
    with tir.block([8, 8]) as [vi, vj]:
        tir.bind(vi, i)
        tir.bind(vj, j)
        A[vi, vj] = B[vi - 1, vj + 1]
        B[vi, vj] = 1

Block isolation is not strong enough

We design Block and expect it to be a level of isolation. But it may not work so well. Here are some cases:

  1. Tensorcore intrinsic
with tir.block([1, 1]):
    wmma.sync(A, B, C)

Tensorcore is expected to be a warp-level operation. It requires 32-threadIdx working together. However, by looking at the block signature, we can not know the constraint. Information lacking may influence additional scheduling.

Things are not only happened in TensorCore. Most of the opaque intrinsics have their unique constraint.

  1. Another scalar case
# Cannot reorder i, k
for i, k in tir.grid(...):
    with tir.block([8, tir.reduce(8)]) as [vi, vk]:
        tir.bind(vi, i)
        tir.bind(vk, k)
        A[i] += B[i, k]
        C[i] = A[i]

Some thoughts

  1. Why do other block-box kernels (e.g., cublas, cudnn) work well?
    They can not be paralleled. There are also so many libraries that can not be run parallelly in one progress.
  2. Should we take responsibility for all of such corner cases?
    It's always a trade-off between correctness and schedule space. But I don't know where the mid-point is.
  3. What will happen during auto-tuning / auto-scheduling if we can not prove the correctness?
    Something terrible.
  4. Create a new mode that turns off some or all the checks while warning the users it allows dangerous behavior.
  5. Introduce some checks for iter type. Or at least we need to allow users to know what is data_par and do not use it without carefully thinking.

@tqchen
Copy link
Contributor

tqchen commented Aug 16, 2021

Great point, one thing we need to keep in mind is our usecase. For example, it would be great to be able to have a clear block isolation for wmma because this seems to be the key need for our major usecase.

We can however, tighten up some of the corner cases as long as we have the things we need in the search space.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants