Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
demo build: https://advanced-micro-devices-demo--446.com.readthedocs.build/projects/omniperf/en/446/
Performance model
Pipeline descriptions
VALU
AGPRs
Pipeline metrics
L1
UTCL1
TA instruction counts
Scalar / Instruction cache
- 64KB / shared between CUs on MI300
L2
- Essentially, we need to add a 128B read request line and figure out how to represent this on the diagram
- 16 channels per XCC, still 256B interleaved
- Likely more involved, need to write some tests to see what triggers these here
- [ ] 128B cache-line there as well
- All atomics are now counted as such on MI300, because they are not cached in L2 and must go to MALL
- Same with:
- HBM Write and Atomic Traffic
- Remote Write and Atomic Traffic
- Atomic Traffic
- Uncached Write and Atomic Traffic
- Need to add 128B read request metric to table
Memory type
New concepts
- [ ] Number of CUs depends on # of XCCs active in the current partitioning mode
- [ ] Number of HBM channels per partition (and thus: the achievable L2<->EA bandwidth) depends on the NPS mode
References