-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Issue]: Roctracer reports GPU Events Ending at Same Time Next Event starts #105
Comments
Summary: As reported in ROCm/roctracer#105, there is an issue where event starts and ends can "tie". This can cause a visual issue in the traces. Lets add a tiny buffer so the events are separate. At the single nanosecond level, the timings are inaccurate anyways so it doesn't really hurt to add this buffer in the meanwhile. Remove/wrap in ifdef once it is issue is resolved Differential Revision: D63296093
Summary: Pull Request resolved: #992 As reported in ROCm/roctracer#105, there is an issue where event starts and ends can "tie". This can cause a visual issue in the traces. Lets add a tiny buffer so the events are separate. At the single nanosecond level, the timings are inaccurate anyways so it doesn't really hurt to add this buffer in the meanwhile. Remove/wrap in ifdef once it is issue is resolved Reviewed By: aaronenyeshi Differential Revision: D63296093 fbshipit-source-id: 09e313e55bbee65f5e6a4974dc52b3e0df4d5922
Hi @sraikund16. Internal ticket has been created to investigate your issue. Thanks! |
Hi @sraikund16, Thanks for reporting the issue. Working with the internal team, we were able to find issues related to the timestamp of events in rocprof on the MI300X's. We were able to isolate it to a firmware issue with the raw reported timestamps on MI300X's when spawning kernels, and are currently working on a resolution for this issue. I will keep you updated with any progress, thanks! |
Problem Description
We notice that many of the events in Roctracer for a single GPU and single queue have a "tie". The first event ends at the exact same nanosecond the second one starts. This is a fairly innocuous bug but can skew kernel metrics if the times are not being reported correctly. Ideally there would be some buffer of nanoseconds between an event end and event start.
This seems to be a different problem than #104 as it seems to be some issue with granularity rather than mismatched timings.
Operating System
CentOS Stream 9
CPU
AMD EPYC 7713
GPU
AMD Instinct MI300X
ROCm Version
ROCm 6.2.0
ROCm Component
rocm-core, roctracer
Steps to Reproduce
No response
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
No response
Additional Information
No response
The text was updated successfully, but these errors were encountered: