-
Notifications
You must be signed in to change notification settings - Fork 115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Issue]: Memory copy started by dependency signal for pinned memory exhibits race condition #248
Comments
Just a thought that occurred to me: I don't think the bug is specific to the SDMA. It's just more easily reproduced that way. If I put a second GPU in the machine, it seems to take longer for the GPU to see the pinned memory - long enough to reproduce the problem with a kernel rather than SDMA copy. |
Hi @nstomlinson. Internal ticket has been created to investigate your issue. Thanks! |
Thanks :) Please let me know if there's anything I can do to help. |
Hi @nstomlinson, I couldn't repro the issue with any/no wait time. Can you send the core dump? |
Hello
Thanks! |
Thanks for the information. I can now reproduce the error with the libs that you provided. I'll look into it. Can you tell me how were the .so files built? I also noticed that you have
With ROCm 6.2.2 it should look like:
I suspect that it's a problem in the kernel driver. Since you are on Arch which is not officially supported, kfd may have unexpected behaviours. Since you need to expose kfd to the docker container, I believe that's also why it fails on Ubuntu. Can you check if the problem if reproducible locally on Ubuntu? |
I ran |
Can you update ROCm on the host to 6.2.2 and check if the problem persists? |
I did (by building the parts of ROCm userspace that it uses). It does still happen in that case. I haven't tested bare metal Ubuntu yet though. |
You can also try 6.2.1 via AUR, see https://aur.archlinux.org/packages/opencl-amd. Unfortunately, Arch is not officially supported so I recommend you use any of the supported OS listed here if possible. |
Closing due to lack of recent activity. Please update the issue if it persists and I'll reopen it. |
Has a kernel patch been submitted etc., or was this issue just closed due to lack of activity? If no action has been taken to fix it, then it does indeed persist! A reproducer has been provided in the initial report. Does this work for you? |
@zichguan-amd please confirm if this was fixed. |
Problem Description
Scheduling a memory copy with
hsa_amd_memory_async_copy
and using a dependency signal to start the copy afterhsaKmtMapMemoryToGPUNodes
has returned seems to result in a race condition. See the attached tarball. Reproduction instructions and other information is inREADME.md
. The exhibiting program is inmain.cpp
.2024100201-pincopy.tar.gz
Operating System
Archlinux
CPU
AMD Ryzen Threadripper 1950X
GPU
AMD Radeon Pro W6800
ROCm Version
ROCm 6.2.2
ROCm Component
ROCK-Kernel-Driver
Steps to Reproduce
Unzip the attached tarball, and follow the instructions in
README.md
.(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
Additional Information
No response
The text was updated successfully, but these errors were encountered: