You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
AMOS represents an innovative approach that leverages automated Mapping Generation and performance optimization to enhance the utilization of emerging hardware units like TensorCore. I have encountered some implementation challenges that I seek guidance on.
in computing compute latency, the intrinsic latency, a fixed value, can be approximated using hardware models. The resulting latency is then multiplied by the trip counts of sequential loops, which operate in a sequential manner not tethered to parallel cores. An inquiry arises: why is this sequencing necessary?
Operations like tiling, fusion, and other scheduling actions typically precede tensorization, leading to the generation of parallel code. Moreover, scheduling adjustments may introduce variations in the number of software iterations. How should this fluctuation be addressed, and what is the current efficacy of the mapping generation process?
The text was updated successfully, but these errors were encountered:
When there are no enough cores for parallel execution, some loops still remain sequential. This is common for tensor computation.
Mapping takes three steps: compute transform, scheduling, tensorization. Compute transform changes the compute expressions according to hardware intrinsic, scheduling mutates the loop structure, tensorization replaces innermost loops with intrinsic. To make sure tensorization won't be affected by scheduling, we perform a pre-tiling step in compute transform step to keep a fixed number of iterations as innermost loops. For example, a GEMM:
for i
for j
for k
...
will be transformed into
for io
for jo
for ko
for ii in range 16
for ji in range 16
for ki in range 16
....
As for the efficacy, mapping generation is fast (several seconds) but performance tuning is slow (tens of minutes).
AMOS represents an innovative approach that leverages automated Mapping Generation and performance optimization to enhance the utilization of emerging hardware units like TensorCore. I have encountered some implementation challenges that I seek guidance on.
The text was updated successfully, but these errors were encountered: