Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory access fault - page not present or supervisor privilege, gfx1031 with HSA_OVERRIDE_GFX_VERSION=10.3.0 #3540

Open
sozforex opened this issue Feb 23, 2025 · 15 comments

Comments

@sozforex
Copy link

sozforex commented Feb 23, 2025

On RX 6850M XT [gfx1031] with HSA_OVERRIDE_GFX_VERSION=10.3.0
Gentoo, HIP version 6.3.42134, MIOpen version 3.3.0

Met with the error by running:
https://github.com/HomebrewML/HeavyBall/blob/e8e44c2594230a59508d64830ed9af1732411f8f/examples/soap.py

Minimal reproduction:

MIOPEN_FIND_ENFORCE=3 HSA_OVERRIDE_GFX_VERSION=10.3.0 HIP_VISIBLE_DEVICES=0 MIOpenDriver conv -n 1024 -c 256 -H 32 -W 32 -k 1 -y 5 -x 5 -p 2 -q 2 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 2 -t 1

Error:

MIOpenDriver conv -n 1024 -c 256 -H 32 -W 32 -k 1 -y 5 -x 5 -p 2 -q 2 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 2 -t 1
PRNG seed: 12345678
Memory access fault by GPU node-1 (Agent handle: 0x55724b0d00b0) on address 0x7fca1be00000. Reason: Page not present or supervisor privilege.
Failed to fetch queues snapshot.
GPU core dump failed
[1]    946669 IOT instruction (core dumped)  MIOPEN_FIND_ENFORCE=3 HSA_OVERRIDE_GFX_VERSION=10.3.0 HIP_VISIBLE_DEVICES=0

Full error log with debug env variables:
https://gist.githubusercontent.com/sozforex/6babbda6cacea2734e225e1a63ee7ae2/raw/c597b59d11062298b61474fb7c77f0b90764bb26/gfx1030_miopen_conv_error

Running the reproduction command with MIOPEN_DEBUG_CONV_GEMM=0 MIOPEN_FIND_ENFORCE=3 I think saves a different result in "miopen find database" and allows one to get around the problem.

@ppanchad-amd
Copy link

Hi @sozforex. Internal ticket has been created to investigate this issue. Thanks!

@sozforex
Copy link
Author

sozforex commented Feb 24, 2025

This issue is possibly related to #1431 - I see -mwavefrontsize64 in debug logs, but wavesize should be 32 for my GPU [will try to do something with it and will report the result].
EDIT: this GPU supports two different modes - WGP mode with -mno-wavefrontsize64 [wavefront size 32] and CU mode with -mwavefrontsize64 -mcumode.

Running the same soap.py, after getting around the previous error stumbled on a similar error
Similar to previous reproduction, but with -F 4 at the end

MIOpenDriver conv -n 1024 -c 256 -H 32 -W 32 -k 256 -y 5 -x 5 -p 2 -q 2 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 4 -t 1

Here is an error log:
https://gist.githubusercontent.com/sozforex/65e7fa023e1ba163ff6bdd81094989d8/raw/bfca7cc804d9a84868a732bff5bdcfcec6bbf0fc/gfx1030_miopen_conv_error2.txt

For this error, running the reproduction command with MIOPEN_DEBUG_CONV_DIRECT=0 MIOPEN_FIND_ENFORCE=3 allowed me to get around the problem.

@averinevg
Copy link
Contributor

averinevg commented Mar 4, 2025

Hi @sozforex. Your gpu is not a gfx1030, it is a gfx1031. Also it is not on the list of supported devices [1][2]. Please correct the title. Both gpus share the same ISA, but they have technical differences. HSA_OVERRIDE_GFX_VERSION allows one device to be detected as another, but it won't add more compute units or missing instructions. This may work somehow in some cases under certain circumstances.

BTW the library has universal kernels that theoretically can run on any hardware. Have you tried running it without HSA_OVERRIDE_GFX_VERSION?

Please also provide rocminfo output.

@sozforex sozforex changed the title Memory access fault - page not present or supervisor privilege, gfx1030. Memory access fault - page not present or supervisor privilege, gfx1031 with HSA_OVERRIDE_GFX_VERSION=10.3.0 Mar 6, 2025
@sozforex
Copy link
Author

sozforex commented Mar 6, 2025

Hi @averinevg, I'm aware that it is not on the list of supported devices - I do not have AMD Radeon PRO W6800 or AMD Radeon PRO V620 to test if this memory access fault can be reproduced on them.

I've tried running without HSA_OVERRIDE_GFX_VERSION [with full rocm compiled with both gfx1030 and gfx1031], I get the same errors.

rocminfo output

=====================
HSA System Attributes
=====================
Runtime Version:         1.1
Runtime Ext Version:     1.6
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE
System Endianness:       LITTLE
Mwaitx:                  DISABLED
DMAbuf Support:          YES

==========
HSA Agents
==========
*******
Agent 2
*******
  Name:                    gfx1031
  Uuid:                    GPU-XX
  Marketing Name:          AMD Radeon RX 6800M
  Vendor Name:             AMD
  Feature:                 KERNEL_DISPATCH
  Profile:                 BASE_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        128(0x80)
  Queue Min Size:          64(0x40)
  Queue Max Size:          131072(0x20000)
  Queue Type:              MULTI
  Node:                    1
  Device Type:             GPU
  Cache Info:
    L1:                      16(0x10) KB
    L2:                      3072(0xc00) KB
    L3:                      98304(0x18000) KB
  Chip ID:                 29663(0x73df)
  ASIC Revision:           0(0x0)
  Cacheline Size:          128(0x80)
  Max Clock Freq. (MHz):   2530
  BDFID:                   768
  Internal Node ID:        1
  Compute Unit:            40
  SIMDs per CU:            2
  Shader Engines:          2
  Shader Arrs. per Eng.:   2
  WatchPts on Addr. Ranges:4
  Coherent Host Access:    FALSE
  Memory Properties:
  Features:                KERNEL_DISPATCH
  Fast F16 Operation:      TRUE
  Wavefront Size:          32(0x20)
  Workgroup Max Size:      1024(0x400)
  Workgroup Max Size per Dimension:
    x                        1024(0x400)
    y                        1024(0x400)
    z                        1024(0x400)
  Max Waves Per CU:        32(0x20)
  Max Work-item Per CU:    1024(0x400)
  Grid Max Size:           4294967295(0xffffffff)
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)
    y                        4294967295(0xffffffff)
    z                        4294967295(0xffffffff)
  Max fbarriers/Workgrp:   32
  Packet Processor uCode:: 122
  SDMA engine uCode::      80
  IOMMU Support::          None
  Pool Info:
    Pool 1
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED
      Size:                    12566528(0xbfc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:2048KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 2
      Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
      Size:                    12566528(0xbfc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:2048KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 3
      Segment:                 GROUP
      Size:                    64(0x40) KB
      Allocatable:             FALSE
      Alloc Granule:           0KB
      Alloc Recommended Granule:0KB
      Alloc Alignment:         0KB
      Accessible by all:       FALSE
  ISA Info:
    ISA 1
      Name:                    amdgcn-amd-amdhsa--gfx1031
      Machine Models:          HSA_MACHINE_MODEL_LARGE
      Profiles:                HSA_PROFILE_BASE
      Default Rounding Mode:   NEAR
      Default Rounding Mode:   NEAR
      Fast f16:                TRUE
      Workgroup Max Size:      1024(0x400)
      Workgroup Max Size per Dimension:
        x                        1024(0x400)
        y                        1024(0x400)
        z                        1024(0x400)
      Grid Max Size:           4294967295(0xffffffff)
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)
        y                        4294967295(0xffffffff)
        z                        4294967295(0xffffffff)
      FBarrier Max Size:       32
*** Done ***

@averinevg
Copy link
Contributor

Hi @sozforex, Since your hardware is not officially supported, the only solution in your case is the approach "try and disable everything that doesn't work." The logs show that in your case, the GEMM and some direct algorithms are not working. To disable them, you need to use the following environment variables:

MIOPEN_DEBUG_CONV_GEMM=0
MIOPEN_DEBUG_CONV_DIRECT_OCL_WRW2=0
MIOPEN_DEBUG_CONV_DIRECT_OCL_WRW53=0

As I see, you are already familiar with them, but instead of disabling all direct algorithms, you can disable only those that are failing.
Please try and provide feedback.

@LunNova
Copy link

LunNova commented Mar 6, 2025

Tested this on a W6800 on the rocm-6.3.3 tag of MIOpen and I can't reproduce it.

MIOPEN_FIND_ENFORCE=3 HIP_VISIBLE_DEVICES=0 ./MIOpenDriver conv -n 1024 -c 256 -H 32 -W 32 -k 1 -y 5 -x 5 -p 2 -q 2 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 2 -t 1
MIOpenDriver conv -n 1024 -c 256 -H 32 -W 32 -k 1 -y 5 -x 5 -p 2 -q 2 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 2 -t 1
PRNG seed: 12345678
MIOpen(HIP): Warning [GetAllConfigs] ConvBinWinogradRxSf3x2: Searching the best solution among 60...
MIOpen(HIP): Warning [GenericSearch] Done: 60/0/60, best #44 3.93031 43
MIOpen(HIP): Warning [GenericSearch] ...Score: 1.20375 (default time 4.73111)
MIOpen(HIP): Warning [SearchImpl] Searching the best solution in the 9 dim space. Please, be patient...
MIOpen(HIP): Warning [SearchImpl] Runs left: 863, min time so far: 8.77174, curr time: 8.77174 16,16,16,16,1,1,1,1,1
MIOpen(HIP): Warning [SearchImpl] Runs left: 813, min time so far: 3.40147, curr time: 3.95063 8,8,16,16,2,2,1,1,2
MIOpen(HIP): Warning [SearchImpl] Runs left: 744, min time so far: 3.19795, curr time: 3.80195 8,32,16,32,2,1,2,1,1
MIOpen(HIP): Warning [SearchImpl] Runs left: 675, min time so far: 3.19795, curr time: 4.19203 32,8,32,16,1,2,4,1,1
MIOpen(HIP): Warning [SearchImpl] Runs left: 624, min time so far: 3.19795, curr time: 3.55043 8,16,32,16,4,1,4,1,2
MIOpen(HIP): Warning [SearchImpl] Runs left: 523, min time so far: 3.19795, curr time: 3.76435 16,16,32,32,2,2,8,1,2
MIOpen(HIP): Warning [SearchImpl] Runs left: 468, min time so far: 3.19795, curr time: 3.87331 8,8,32,32,4,4,2,1,1
MIOpen(HIP): Warning [SearchImpl] Default run, min time so far: 3.19795, default time: 3.80835 16,16,32,32,2,2,8,2,1
MIOpen(HIP): Warning [SearchImpl] ...Score: 1.19087
MIOpen Backward Data Conv. Algorithm: 3, Solution: 84/ConvBinWinogradRxSf2x3g1
GPU Kernel Time Backward Data Conv. Elapsed: 3.051616 ms (average)
stats: name, n, c, ho, wo, x, y, k, flopCnt, bytesRead, bytesWritten, GFLOPs, GB/s, timeMs
stats: bwdd-conv5x5u1, 1024, 256, 5, 5, 1, 32, 32,  13421772800, 1074176, 4194304, 4398, 2, 3.051616
Backward Convolution Data Verifies OK on GPU reference (1.94073e-08 < 1.5e-06)

Have you tried building MIOpen for gfx1031 specifically instead of using the arch override?

@sozforex
Copy link
Author

sozforex commented Mar 6, 2025

@averinevg, thank you. When I've looked for env variables to disable a smaller subset of algorithms, I've tried only some of those listed in https://github.com/ROCm/MIOpen/blob/develop/docs/how-to/debug-log.rst [and lacking understanding missed the last two you've listed].

MIOPEN_DEBUG_CONV_GEMM=0
helps with
MIOpenDriver conv -n 1024 -c 256 -H 32 -W 32 -k 1 -y 5 -x 5 -p 2 -q 2 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 2 -t 1

MIOPEN_DEBUG_CONV_DIRECT_OCL_WRW2=0
helps with
MIOpenDriver conv -n 1024 -c 256 -H 32 -W 32 -k 256 -y 5 -x 5 -p 2 -q 2 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 4 -t 1

The above two env variables are sufficient when running soap.py to not to get memory access fault errors on my GPU, thanks.

Have not stumbled yet on a case where MIOPEN_DEBUG_CONV_DIRECT_OCL_WRW53=0 may be needed.

Have you tried building MIOpen for gfx1031 specifically instead of using the arch override?

Hi @LunNova , thanks for testing this on an actual gfx1030.
I've tried with and without an override with full rocm [including MIOpen] compiled with both gfx1030 and gfx1031 at the same time [either rocm-6.3.1 or rocm-6.3.2],
and separately with an override with rocm-6.3.3 compiled for gfx1030.

Oh, not full rocm - I'm using llvm/clang-19.1.7 [on Gentoo] instead of AOCC or the version of llvm that comes with official rocm releases.

@sozforex
Copy link
Author

sozforex commented Mar 7, 2025

Just in case, tried this again with rocm-6.3.3 [including rocBLAS, Tensile and MIOpen] compiled only with gfx1031 [without gfx1030].

MIOPEN_FIND_ENFORCE=3 HIP_VISIBLE_DEVICES=0 MIOpenDriver conv -n 1024 -c 256 -H 32 -W 32 -k 1 -y 5 -x 5 -p 2 -q 2 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 2 -t 1
without MIOPEN_DEBUG_CONV_GEMM=0
results in rocBLAS/Tensile error:

...
MIOpen(HIP): Info2 [SQLiteBase] Initializing user database file "/home/user/.cache/miopen/3.3.0./gfx1031_20.ukdb"
MIOpen(HIP): Info2 [KernDb] Database created successfully
MIOpen(HIP): Info2 [LoadBinary] Loading binary for: "Conv_Winograd_v30_3_1_fp32_f2x3_stride1.s.o"; args: -Wa,-defsym,ROCM_METADATA_VERSION=5 -Wa,-defsym,FORCE_CACHE_BYPASS_ON_STORE=0 -mcumode -mwavefrontsize64 -mcpu=gfx1031
MIOpen(HIP): Info2 [Prepare] SELECT kernel_blob, kernel_hash, uncompressed_size FROM kern_db WHERE (kernel_name = 'Conv_Winograd_v30_3_1_fp32_f2x3_stride1.s.o') AND (kernel_args = '-Wa,-defsym,ROCM_METADATA_VERSION=5 -Wa,-defsym,FORCE_CACHE_BYPASS_ON_STORE=0 -mcumode -mwavefrontsize64 -mcpu=gfx1031');
MIOpen(HIP): Info2 [Measure] Db::FindRecord time: 0.738709 ms
MIOpen(HIP): Info2 [LoadBinary] Successfully loaded binary for: "Conv_Winograd_v30_3_1_fp32_f2x3_stride1.s.o"; args: -Wa,-defsym,ROCM_METADATA_VERSION=5 -Wa,-defsym,FORCE_CACHE_BYPASS_ON_STORE=0 -mcumode -mwavefrontsize64 -mcpu=gfx1031
MIOpen(HIP): Info2 [PrepareInvoker] Preparing kernel: MIOpenConvUni
MIOpen(HIP): Info2 [run] kernel_name = MIOpenConvUni, global_work_dim = { 256, 64, 1024 }, local_work_dim = { 128, 1, 1 }
MIOpen(HIP): Info2 [run] kernel_name = MIOpenConvUni, global_work_dim = { 256, 64, 1024 }, local_work_dim = { 128, 1, 1 }
MIOpen(HIP): Info2 [run] kernel_name = MIOpenConvUni, global_work_dim = { 256, 64, 1024 }, local_work_dim = { 128, 1, 1 }
MIOpen(HIP): Info2 [run] kernel_name = MIOpenConvUni, global_work_dim = { 256, 64, 1024 }, local_work_dim = { 128, 1, 1 }
MIOpen(HIP): Info2 [run] kernel_name = MIOpenConvUni, global_work_dim = { 256, 64, 1024 }, local_work_dim = { 128, 1, 1 }
MIOpen(HIP): Info2 [run] kernel_name = MIOpenConvUni, global_work_dim = { 256, 64, 1024 }, local_work_dim = { 128, 1, 1 }
MIOpen(HIP): Info2 [run] kernel_name = MIOpenConvUni, global_work_dim = { 256, 64, 1024 }, local_work_dim = { 128, 1, 1 }
MIOpen(HIP): Info2 [run] kernel_name = MIOpenConvUni, global_work_dim = { 256, 64, 1024 }, local_work_dim = { 128, 1, 1 }
MIOpen(HIP): Info [EvaluateInvokers] ConvOclDirectFwd: MIOpenConvUni: 3.17666 < 3.40282e+38
MIOpen(HIP): Info2 [PrepareInvoker] Preparing kernel: naive_conv_ab_nonpacked_bwd_nchw_float_double_float
MIOpen(HIP): Info2 [run] kernel_name = naive_conv_ab_nonpacked_bwd_nchw_float_double_float, global_work_dim = { 67108864, 1, 1 }, local_work_dim = { 256, 1, 1 }
MIOpen(HIP): Info2 [run] kernel_name = naive_conv_ab_nonpacked_bwd_nchw_float_double_float, global_work_dim = { 67108864, 1, 1 }, local_work_dim = { 256, 1, 1 }
MIOpen(HIP): Info2 [run] kernel_name = naive_conv_ab_nonpacked_bwd_nchw_float_double_float, global_work_dim = { 67108864, 1, 1 }, local_work_dim = { 256, 1, 1 }
MIOpen(HIP): Info2 [run] kernel_name = naive_conv_ab_nonpacked_bwd_nchw_float_double_float, global_work_dim = { 67108864, 1, 1 }, local_work_dim = { 256, 1, 1 }
MIOpen(HIP): Info2 [run] kernel_name = naive_conv_ab_nonpacked_bwd_nchw_float_double_float, global_work_dim = { 67108864, 1, 1 }, local_work_dim = { 256, 1, 1 }
MIOpen(HIP): Info2 [run] kernel_name = naive_conv_ab_nonpacked_bwd_nchw_float_double_float, global_work_dim = { 67108864, 1, 1 }, local_work_dim = { 256, 1, 1 }
MIOpen(HIP): Info2 [run] kernel_name = naive_conv_ab_nonpacked_bwd_nchw_float_double_float, global_work_dim = { 67108864, 1, 1 }, local_work_dim = { 256, 1, 1 }
MIOpen(HIP): Info2 [run] kernel_name = naive_conv_ab_nonpacked_bwd_nchw_float_double_float, global_work_dim = { 67108864, 1, 1 }, local_work_dim = { 256, 1, 1 }
MIOpen(HIP): Info [EvaluateInvokers] ConvDirectNaiveConvBwd: naive_conv_ab_nonpacked_bwd_nchw_float_double_float: 131.582 >= 3.17666
MIOpen(HIP): Info2 [Register] Invoker registered for algorithm 1x32x32x5x5x256x32x32x1024xNCHWxFP32x2x2x1x1x1x1x1xBxDefault and solver ConvOclDirectFwd
MIOpen(HIP): Info2 [SetAsFound1_0] Solver ConvOclDirectFwd registered as find 1.0 best for miopenConvolutionBwdDataAlgoDirect in 1x32x32x5x5x256x32x32x1024xNCHWxFP32x2x2x1x1x1x1x1xBxDefault
MIOpen(HIP): Info [EvaluateInvokers] Selected: ConvOclDirectFwd: MIOpenConvUni: 3.17666, workspace_sz = 0
MIOpen(HIP): auto miopen::solver::conv::GemmBwdRest::GetSolution(const ExecutionContext &, const ProblemDescription &)::(anonymous class)::operator()(const std::vector<Kernel> &)::(anonymous class)::operator()(const Handle &, const AnyInvokeParams &) const{
MIOpen(HIP):    "convolution, non 1x1" = convolution, non 1x1
MIOpen(HIP): }
MIOpen(HIP): Info2 [CallGemm] gemm_desc: {isColMajor 0, transA 1, transB 0, m 6400, n 1024, k 1, lda 6400, ldb 1024, ldc 1024, batch_count 1, strideA 0, strideB 0, strideC 0, alpha 1, beta 0, dataType float, a_cast_type float, b_cast_type float} 
MIOpen(HIP): Info2 [CallGemm] rocBLAS

rocBLAS error: Cannot read /usr/lib64/rocblas/library/TensileLibrary.dat: No such file or directory for GPU arch : gfx1030
 List of available TensileLibrary Files : 
"/usr/lib64/rocblas/library/TensileLibrary_lazy_gfx1031.dat"
[1]    2159995 IOT instruction (core dumped)  MIOPEN_FIND_ENFORCE=3 MIOPEN_LOG_LEVEL=6 MIOPEN_ENABLE_LOGGING=1 =1 =0  conv

The second command
MIOPEN_FIND_ENFORCE=3 HIP_VISIBLE_DEVICES=0 MIOpenDriver conv -n 1024 -c 256 -H 32 -W 32 -k 256 -y 5 -x 5 -p 2 -q 2 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 4 -t 1
without MIOPEN_DEBUG_CONV_DIRECT_OCL_WRW2=0 results in the same memory access fault as described previously.

Running it with MIOPEN_DEBUG_CONV_DIRECT_OCL_WRW2=0, results in a rocBLAS/Tensile error of the same kind as above.

rocBLAS error: Cannot read /usr/lib64/rocblas/library/TensileLibrary.dat: Illegal seek for GPU arch : gfx1030                                                                                         
 List of available TensileLibrary Files :                                                                                                                                                             
"/usr/lib64/rocblas/library/TensileLibrary_lazy_gfx1031.dat"                                                                                                                                          
[1]    2160815 IOT instruction (core dumped)  MIOPEN_DEBUG_CONV_DIRECT_OCL_WRW2=0 MIOPEN_FIND_ENFORCE=3 MIOPEN_LOG_LEVEL=5

With both MIOPEN_DEBUG_CONV_DIRECT_OCL_WRW2=0 and MIOPEN_DEBUG_CONV_GEMM=0 it runs to completion without errors.

@averinevg
Copy link
Contributor

Tested this on a W6800 on the rocm-6.3.3 tag of MIOpen and I can't reproduce it.

Hi @LunNova, could you please check again with MIOPEN_DEBUG_CONV_WINOGRAD=0?

@LunNova
Copy link

LunNova commented Mar 7, 2025

$ MIOPEN_DEBUG_CONV_WINOGRAD=0 MIOPEN_FIND_ENFORCE=3 ./MIOpenDriver conv -n 1024 -c 256 -H 32 -W 32 -k 1 -y 5 -x 5 -p 2 -q 2 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 2 -t 1
MIOpenDriver conv -n 1024 -c 256 -H 32 -W 32 -k 1 -y 5 -x 5 -p 2 -q 2 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 2 -t 1
PRNG seed: 12345678
MIOpen Backward Data Conv. Algorithm: 1, Solution: 11/ConvOclDirectFwd
GPU Kernel Time Backward Data Conv. Elapsed: 3.381792 ms (average)
stats: name, n, c, ho, wo, x, y, k, flopCnt, bytesRead, bytesWritten, GFLOPs, GB/s, timeMs
stats: bwdd-conv5x5u1, 1024, 256, 5, 5, 1, 32, 32,  13421772800, 1074176, 4194304, 3969, 2, 3.381792
Backward Convolution Data Verifies OK on GPU reference (1.74561e-08 < 1.5e-06)

@averinevg
Copy link
Contributor

@LunNova Thank you. Could you also please check MIOpenDriver conv -n 1024 -c 256 -H 32 -W 32 -k 256 -y 5 -x 5 -p 2 -q 2 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 4 -t 1 with the same env variable?

@averinevg
Copy link
Contributor

Just in case, tried this again with rocm-6.3.3 [including rocBLAS, Tensile and MIOpen] compiled only with gfx1031 [without gfx1030].

...
rocBLAS error: Cannot read /usr/lib64/rocblas/library/TensileLibrary.dat: No such file or directory for GPU arch : gfx1030
List of available TensileLibrary Files :
"/usr/lib64/rocblas/library/TensileLibrary_lazy_gfx1031.dat"
...

Hi @sozforex, thank you for your research. This error comes from the depths of the rocBLAS. MIOpen uses it for some operations. I see that it is built for gfx1031, but for some reason it detects your gpu as a gfx1030. Did you use HSA_OVERRIDE_GFX_VERSION?

@sozforex
Copy link
Author

sozforex commented Mar 7, 2025

Did you use HSA_OVERRIDE_GFX_VERSION?

Hi @averinevg, I've run it without the HSA_OVERRIDE_GFX_VERSION [I've checked that it is unset].

You can see -mcpu=gfx1031 in this part of the log above:

...
MIOpen(HIP): Info2 [LoadBinary] Loading binary for: "Conv_Winograd_v30_3_1_fp32_f2x3_stride1.s.o"; args: -Wa,-defsym,ROCM_METADATA_VERSION=5 -Wa,-defsym,FORCE_CACHE_BYPASS_ON_STORE=0 -mcumode -mwavefrontsize64 -mcpu=gfx1031
MIOpen(HIP): Info2 [Prepare] SELECT kernel_blob, kernel_hash, uncompressed_size FROM kern_db WHERE (kernel_name = 'Conv_Winograd_v30_3_1_fp32_f2x3_stride1.s.o') AND (kernel_args = '-Wa,-defsym,ROCM_METADATA_VERSION=5 -Wa,-defsym,FORCE_CACHE_BYPASS_ON_STORE=0 -mcumode -mwavefrontsize64 -mcpu=gfx1031');
...

I remember now that Gentoo patches rocBLAS and Tensile to extend compatibility:
https://github.com/gentoo/gentoo/blob/master/sci-libs/rocBLAS/files/rocBLAS-6.0.2-expand-isa-compatibility.patch
https://github.com/gentoo/gentoo/blob/master/dev-util/Tensile/files/Tensile-6.0.2-expand-isa-compatibility.patch

These compatibility extending patches may not work as intended when those packages are compiled with gfx1031 but without gfx1030. [I'm not really sure]

@LunNova
Copy link

LunNova commented Mar 8, 2025

MIOPEN_DEBUG_CONV_WINOGRAD=0 MIOPEN_FIND_ENFORCE=3 ./MIOpenDriver conv -n 1024 -c 256 -H 32 -W 32 -k 256 -y 5 -x 5 -p 2 -q 2 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 4 -t 1
MIOpenDriver conv -n 1024 -c 256 -H 32 -W 32 -k 256 -y 5 -x 5 -p 2 -q 2 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 4 -t 1
PRNG seed: 12345678
MIOpen(HIP): Warning [GetAllConfigs] ConvOclBwdWrW2<2>: Searching the best solution among 378...
MIOpen(HIP): Warning [Monitor] 0/0/378 2407, best within recent 1: 2407 #0 1,7,1,1,11, ETA:0 sec.
<snipped lots of similar lines>
MIOpen(HIP): Warning [GenericSearch] Done: 36/0/36, best #7 460.478 64,64,64,16,4,4
MIOpen(HIP): Warning [GenericSearch] ...Score: 1.0087 (default time 464.483)
MIOpen Backward Weights Conv. Algorithm: 1, Solution: 24/ConvOclBwdWrW53
GPU Kernel Time Backward Weights Conv. Elapsed: 272.043365 ms (average)
stats: name, n, c, ho, wo, x, y, k, flopCnt, bytesRead, bytesWritten, GFLOPs, GB/s, timeMs
stats: bwdw-conv5x5u1, 1024, 256, 32, 32, 5, 5, 256,  3435973836800, 0, 0, 12630, 0, 272.043365
Backward Convolution Weights FAILED: 3.64114e-06 > 3e-06

@sozforex
Copy link
Author

sozforex commented Mar 8, 2025

I have not noticed it previously [as it is not an exception], but I get the same Backward Convolution Weights FAILED: 3.64114e-06 > 3e-06 result when running with or without the override [this time with rocm compiled with both gfx1030 and gfx1031] and with MIOPEN_DEBUG_CONV_DIRECT_OCL_WRW2=0.

MIOPEN_DEBUG_CONV_DIRECT_OCL_WRW2=0 MIOPEN_FIND_ENFORCE=3 HIP_VISIBLE_DEVICES=0 MIOpenDriver conv -n 1024 -c 256 -H 32 -W 32 -k 256 -y 5 -x 5 -p 2 -q 2 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 4 -t 1
MIOpenDriver conv -n 1024 -c 256 -H 32 -W 32 -k 256 -y 5 -x 5 -p 2 -q 2 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 4 -t 1
PRNG seed: 12345678
MIOpen Backward Weights Conv. Algorithm: 1, Solution: 24/ConvOclBwdWrW53
GPU Kernel Time Backward Weights Conv. Elapsed: 416.347260 ms (average)
stats: name, n, c, ho, wo, x, y, k, flopCnt, bytesRead, bytesWritten, GFLOPs, GB/s, timeMs
stats: bwdw-conv5x5u1, 1024, 256, 32, 32, 5, 5, 256,  3435973836800, 0, 0, 8253, 0, 416.347260
Backward Convolution Weights FAILED: 3.64114e-06 > 3e-06

With MIOPEN_DEBUG_CONV_WINOGRAD=0 instead of MIOPEN_DEBUG_CONV_DIRECT_OCL_WRW2=0 I get a memory access fault.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants