-
Notifications
You must be signed in to change notification settings - Fork 242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory access fault - page not present or supervisor privilege, gfx1031 with HSA_OVERRIDE_GFX_VERSION=10.3.0 #3540
Comments
Hi @sozforex. Internal ticket has been created to investigate this issue. Thanks! |
Running the same
Here is an error log: For this error, running the reproduction command with |
Hi @sozforex. Your gpu is not a gfx1030, it is a gfx1031. Also it is not on the list of supported devices [1][2]. Please correct the title. Both gpus share the same ISA, but they have technical differences. BTW the library has universal kernels that theoretically can run on any hardware. Have you tried running it without Please also provide rocminfo output. |
Hi @averinevg, I'm aware that it is not on the list of supported devices - I do not have AMD Radeon PRO W6800 or AMD Radeon PRO V620 to test if this memory access fault can be reproduced on them. I've tried running without HSA_OVERRIDE_GFX_VERSION [with full rocm compiled with both gfx1030 and gfx1031], I get the same errors.
|
Hi @sozforex, Since your hardware is not officially supported, the only solution in your case is the approach "try and disable everything that doesn't work." The logs show that in your case, the GEMM and some direct algorithms are not working. To disable them, you need to use the following environment variables: MIOPEN_DEBUG_CONV_GEMM=0 As I see, you are already familiar with them, but instead of disabling all direct algorithms, you can disable only those that are failing. |
Tested this on a W6800 on the rocm-6.3.3 tag of MIOpen and I can't reproduce it.
Have you tried building MIOpen for gfx1031 specifically instead of using the arch override? |
@averinevg, thank you. When I've looked for env variables to disable a smaller subset of algorithms, I've tried only some of those listed in https://github.com/ROCm/MIOpen/blob/develop/docs/how-to/debug-log.rst [and lacking understanding missed the last two you've listed]. MIOPEN_DEBUG_CONV_GEMM=0 MIOPEN_DEBUG_CONV_DIRECT_OCL_WRW2=0 The above two env variables are sufficient when running soap.py to not to get memory access fault errors on my GPU, thanks. Have not stumbled yet on a case where MIOPEN_DEBUG_CONV_DIRECT_OCL_WRW53=0 may be needed.
Hi @LunNova , thanks for testing this on an actual gfx1030. Oh, not full rocm - I'm using llvm/clang-19.1.7 [on Gentoo] instead of AOCC or the version of llvm that comes with official rocm releases. |
Just in case, tried this again with rocm-6.3.3 [including rocBLAS, Tensile and MIOpen] compiled only with gfx1031 [without gfx1030].
The second command Running it with MIOPEN_DEBUG_CONV_DIRECT_OCL_WRW2=0, results in a rocBLAS/Tensile error of the same kind as above.
With both MIOPEN_DEBUG_CONV_DIRECT_OCL_WRW2=0 and MIOPEN_DEBUG_CONV_GEMM=0 it runs to completion without errors. |
Hi @LunNova, could you please check again with |
|
@LunNova Thank you. Could you also please check |
Hi @sozforex, thank you for your research. This error comes from the depths of the |
Hi @averinevg, I've run it without the HSA_OVERRIDE_GFX_VERSION [I've checked that it is unset]. You can see
I remember now that Gentoo patches rocBLAS and Tensile to extend compatibility: These compatibility extending patches may not work as intended when those packages are compiled with gfx1031 but without gfx1030. [I'm not really sure] |
|
I have not noticed it previously [as it is not an exception], but I get the same
With |
On RX 6850M XT [gfx1031] with HSA_OVERRIDE_GFX_VERSION=10.3.0
Gentoo, HIP version 6.3.42134, MIOpen version 3.3.0
Met with the error by running:
https://github.com/HomebrewML/HeavyBall/blob/e8e44c2594230a59508d64830ed9af1732411f8f/examples/soap.py
Minimal reproduction:
Error:
Full error log with debug env variables:
https://gist.githubusercontent.com/sozforex/6babbda6cacea2734e225e1a63ee7ae2/raw/c597b59d11062298b61474fb7c77f0b90764bb26/gfx1030_miopen_conv_error
Running the reproduction command with
MIOPEN_DEBUG_CONV_GEMM=0 MIOPEN_FIND_ENFORCE=3
I think saves a different result in "miopen find database" and allows one to get around the problem.The text was updated successfully, but these errors were encountered: