Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Issue]: Roctracer reports GPU Events Ending at Same Time Next Event starts #105

Open
sraikund16 opened this issue Sep 23, 2024 · 2 comments

Comments

@sraikund16
Copy link

Problem Description

We notice that many of the events in Roctracer for a single GPU and single queue have a "tie". The first event ends at the exact same nanosecond the second one starts. This is a fairly innocuous bug but can skew kernel metrics if the times are not being reported correctly. Ideally there would be some buffer of nanoseconds between an event end and event start.

This seems to be a different problem than #104 as it seems to be some issue with granularity rather than mismatched timings.

Operating System

CentOS Stream 9

CPU

AMD EPYC 7713

GPU

AMD Instinct MI300X

ROCm Version

ROCm 6.2.0

ROCm Component

rocm-core, roctracer

Steps to Reproduce

No response

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

sraikund16 added a commit to sraikund16/kineto that referenced this issue Sep 23, 2024
Summary: As reported in ROCm/roctracer#105, there is an issue where event starts and ends can "tie". This can cause a visual issue in the traces. Lets add a tiny buffer so the events are separate. At the single nanosecond level, the timings are inaccurate anyways so it doesn't really hurt to add this buffer in the meanwhile. Remove/wrap in ifdef once it is issue is resolved

Differential Revision: D63296093
facebook-github-bot pushed a commit to pytorch/kineto that referenced this issue Sep 25, 2024
Summary:
Pull Request resolved: #992

As reported in ROCm/roctracer#105, there is an issue where event starts and ends can "tie". This can cause a visual issue in the traces. Lets add a tiny buffer so the events are separate. At the single nanosecond level, the timings are inaccurate anyways so it doesn't really hurt to add this buffer in the meanwhile. Remove/wrap in ifdef once it is issue is resolved

Reviewed By: aaronenyeshi

Differential Revision: D63296093

fbshipit-source-id: 09e313e55bbee65f5e6a4974dc52b3e0df4d5922
@ppanchad-amd
Copy link

Hi @sraikund16. Internal ticket has been created to investigate your issue. Thanks!

@darren-amd
Copy link

Hi @sraikund16,

Thanks for reporting the issue. Working with the internal team, we were able to find issues related to the timestamp of events in rocprof on the MI300X's. We were able to isolate it to a firmware issue with the raw reported timestamps on MI300X's when spawning kernels, and are currently working on a resolution for this issue. I will keep you updated with any progress, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants