forked from pytorch/executorch
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add Shared Memory Bandwidth Profiler (pytorch#4277)
Summary: Pull Request resolved: pytorch#4277 This diff introduces a profiler that obtains the maximum and minimum bandwidth for reading unique addresses from UBOs, using the following shader, where A is a shared buffer and B is a writeonly buffer. shared vec4 A[nvec]; void main() { vec4 sum = vec4(0); const uint workgroup_width = local_group_size * niter * ${NUNROLL}; uint offset = (gl_WorkGroupID[0] * workgroup_width + gl_LocalInvocationID[0]) & addr_mask; int i = 0; for (; i < niter; ++i) { sum *= A[offset]; offset = (offset + local_group_size) & addr_mask; ... ... sum *= A[offset]; offset = (offset + local_group_size) & addr_mask; } vec4 zero = vec4(i>>31); B[gl_LocalInvocationID[0]] = sum + zero; } The address mask allows us to control how many unique addresses we are accessing. If the number of unique vectors we want to read is 3, the offset will jump between three unique addresses throughout the iterations, giving us the bandwidth for that specific size of data. If the size of the unique data read is larger than the work group size, then each run will have its own block of data to read, defined by the initial offset calculation, where the offset is obtained through the workgroup ID and the local invocation ID. Finally, we make sure to use the `sum` and `i ` variables so that the compiler's optimizer does not flatten the loops. For a Samsung S22, the bandwidth behaves like this. We can see that accessing the shared memory has a constant latency, until it reaches the Maximum Shared Memory size. NOTE: The graph is extended for visualization purposes, the experiment stops before it drops, because otherwise it would crash. {F1759597657} Comparing it to OpenCL, we can observe that, although the behavior is the same, Vulkan has an increased bandwidth. {F1759600867} Reviewed By: copyrightly Differential Revision: D59811152 fbshipit-source-id: 537be13dbec1a02cb55e689db2a0fd548613c729
- Loading branch information
1 parent
a4decca
commit e5687a4
Showing
3 changed files
with
43 additions
and
27 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -13,5 +13,6 @@ buf_bandwidth: | |
MEMTYPE: | ||
- VALUE: ubo | ||
- VALUE: buffer | ||
- VALUE: shared | ||
shader_variants: | ||
- NAME: buf_bandwidth |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters