Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some Qs about implementation #8

Open
lylOwhd opened this issue Feb 21, 2024 · 1 comment
Open

Some Qs about implementation #8

lylOwhd opened this issue Feb 21, 2024 · 1 comment

Comments

@lylOwhd
Copy link

lylOwhd commented Feb 21, 2024

AMOS represents an innovative approach that leverages automated Mapping Generation and performance optimization to enhance the utilization of emerging hardware units like TensorCore. I have encountered some implementation challenges that I seek guidance on.

  1. in computing compute latency, the intrinsic latency, a fixed value, can be approximated using hardware models. The resulting latency is then multiplied by the trip counts of sequential loops, which operate in a sequential manner not tethered to parallel cores. An inquiry arises: why is this sequencing necessary?
  2. Operations like tiling, fusion, and other scheduling actions typically precede tensorization, leading to the generation of parallel code. Moreover, scheduling adjustments may introduce variations in the number of software iterations. How should this fluctuation be addressed, and what is the current efficacy of the mapping generation process?
@KnowingNothing
Copy link
Collaborator

  1. When there are no enough cores for parallel execution, some loops still remain sequential. This is common for tensor computation.
  2. Mapping takes three steps: compute transform, scheduling, tensorization. Compute transform changes the compute expressions according to hardware intrinsic, scheduling mutates the loop structure, tensorization replaces innermost loops with intrinsic. To make sure tensorization won't be affected by scheduling, we perform a pre-tiling step in compute transform step to keep a fixed number of iterations as innermost loops. For example, a GEMM:
for i 
 for j 
  for k
   ...

will be transformed into

for io
 for jo
  for ko
   for ii in range 16
    for ji in range 16
     for ki in range 16
      ....

As for the efficacy, mapping generation is fast (several seconds) but performance tuning is slow (tens of minutes).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants