-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Issue]: Roctracer GPU Events Have Overlapping Intervals #104
Comments
Since the overlap is so small I am thinking that there could be possibly some rounding issue that is going on? |
Here is another print with the queue ids outputted: |
Hi @sraikund16. Internal ticket has been created to investigate your issue. Thanks! |
Hi @sraikund16, I was not able to built your branch locally on an 7900 XTX, could you let me know what build steps you are following (including environment variables you have set) as well as how you are running your example? This should help me reproduce the issue to help further, thanks! |
Hello, you can build off of main on PyTorch and run a basic training job to reproduce this issue. My branch just adds debug to the output of roctracer to show that there are overlapping intervals from the raw output of roctracer. As mentioned in the description of this post. I found that certain events appear to overlap more frequently than others so it might be best to induce those. Thanks! |
Hi @sraikund16, This appears to be a similar issue to #105, which we are currently working towards a fix for, please let me know if you have any concerns, thanks! |
Problem Description
When running a very small Resnet50 model, I am seeing that GPU events on a single track (stream/queue) have events with overlapping time intervals. I see these issues commonly in very specific kernels such as MIOpenBatchNormBwdSpatial and batched_transpose_32x32_dword which have kind=0x11F0 and op=0. To investigate further, I created a debug branch here to see what the output of roctracer (before kineto does any processing) was returning: https://github.com/pytorch/kineto/pull/990/files
In this branch I have a debug that triggers several messages similar to the following:
Out of order activity: 1886121463888334 < 1886121463888361. Difference: 27 ns. Kernel: batched_transpose_32x32_dword last Kernel: MIOpenBatchNormFwdTrainSpatialNorml
which suggests that there is interval overlapping. In this branch I am only check for overlapping events for non-unknown kind events but there are also many overlappings there as well.
Thanks!
Operating System
CentOS Stream 9
CPU
AMD EPYC 7713
GPU
AMD Instinct MI300X
ROCm Version
ROCm 6.2.0
ROCm Component
roctracer
Steps to Reproduce
Run model with the kernels specified above and observe if they overlap or not
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
No response
Additional Information
No response
The text was updated successfully, but these errors were encountered: