Segmentation fault in SPMV benchmark on GPU target #1269

edopao · 2023-06-07T13:01:35Z

Segmentation fault when running the spmv benchmark from npbench with dace_gpu framework

(.venv) epaone@nid01934:/scratch/snx3000/epaone/repo/npbench> python run_benchmark.py -b spmv -f dace_cpu
***** Testing DaCe CPU with spmv on the S dataset *****
NumPy - default - validation: 13ms
DaCe CPU - fusion - first/validation: 57ms
DaCe CPU - fusion - validation: SUCCESS
DaCe CPU - fusion - median: 55ms
DaCe CPU - parallel - first/validation: 2ms
DaCe CPU - parallel - validation: SUCCESS
DaCe CPU - parallel - median: 1ms
DaCe CPU - auto_opt - first/validation: 2ms
DaCe CPU - auto_opt - validation: SUCCESS
DaCe CPU - auto_opt - median: 1ms
(.venv) epaone@nid01934:/scratch/snx3000/epaone/repo/npbench> python run_benchmark.py -b gemm -f dace_gpu
***** Testing DaCe GPU with gemm on the S dataset *****
NumPy - default - validation: 111ms
DaCe GPU - fusion - first/validation: 423ms
DaCe GPU - fusion - validation: SUCCESS
DaCe GPU - fusion - median: 2ms
DaCe GPU - parallel - first/validation: 4ms
DaCe GPU - parallel - validation: SUCCESS
DaCe GPU - parallel - median: 2ms
DaCe GPU - auto_opt - first/validation: 4ms
DaCe GPU - auto_opt - validation: SUCCESS
DaCe GPU - auto_opt - median: 1ms
(.venv) epaone@nid01934:/scratch/snx3000/epaone/repo/npbench> python run_benchmark.py -b spmv -f dace_gpu
***** Testing DaCe GPU with spmv on the S dataset *****
NumPy - default - validation: 13ms
Segmentation fault (core dumped)

From gdb backtrace it is possible to locate the segmentation in access to A_row array:

Thread 1 "python" received signal SIGSEGV, Segmentation fault.                                                                                                                                                                                                                                                             
0x000015554fe9b27a in __program_fusion_internal (__state=0x5842880, A_col=0x1553e2a04200, A_row=0x1553e2a00000, A_val=0x1553e2a0c200, __return=0x1553e2a24200, x=0x1553e2a1c200, M=4096, N=4096, nnz=8192) at /scratch/snx3000/epaone/repo/npbench/.dacecache/fusion/src/cpu/fusion.cpp:25                                 
25              __tmp4 = A_row[(i + 1)];                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
(gdb) bt                                                                                                                                                                                                                                                                                                                   
#0  0x000015554fe9b27a in __program_fusion_internal (__state=0x5842880, A_col=0x1553e2a04200, A_row=0x1553e2a00000, A_val=0x1553e2a0c200, __return=0x1553e2a24200, x=0x1553e2a1c200, M=4096, N=4096, nnz=8192) at /scratch/snx3000/epaone/repo/npbench/.dacecache/fusion/src/cpu/fusion.cpp:25                             
#1  0x000015551e3466dd in ?? () from /usr/lib64/libffi.so.7
#2  0x000015551e345bdf in ?? () from /usr/lib64/libffi.so.7
#3  0x000015551e5a0a41 in _call_function_pointer (argtypecount=<optimized out>, argcount=9, resmem=0x7fffffff3b20, restype=<optimized out>, atypes=<optimized out>, avalues=<optimized out>, 
    pProc=0x15554fe9b570 <__program_fusion(fusion_t *, unsigned int * __restrict__, unsigned int * __restrict__, double * __restrict__, double * __restrict__, double * __restrict__, long long, long long, long long)>, flags=4353)
    at /home/users/mnusseibeh/rpmbuild/BUILD/cray-python-3.9.4.1-202108131723_038f7ca/Python-3.9.4/Modules/_ctypes/callproc.c:920
#4  _ctypes_callproc (pProc=pProc@entry=0x15554fe9b570 <__program_fusion(fusion_t *, unsigned int * __restrict__, unsigned int * __restrict__, double * __restrict__, double * __restrict__, double * __restrict__, long long, long long, long long)>, argtuple=argtuple@entry=0x1555501ecba0, flags=4353, 
    argtypes=argtypes@entry=0x155555477040, restype=0x84ff70, checker=0x0) at /home/users/mnusseibeh/rpmbuild/BUILD/cray-python-3.9.4.1-202108131723_038f7ca/Python-3.9.4/Modules/_ctypes/callproc.c:1263
#5  0x000015551e595699 in PyCFuncPtr_call (self=self@entry=0x1555501d51c0, inargs=inargs@entry=0x1555501ecba0, kwds=0x0) at /home/users/mnusseibeh/rpmbuild/BUILD/cray-python-3.9.4.1-202108131723_038f7ca/Python-3.9.4/Modules/_ctypes/_ctypes.c:4201
#6  0x0000155554eafcc1 in _PyObject_Call (tstate=0x605ef0, callable=0x1555501d51c0, args=0x1555501ecba0, kwargs=<optimized out>) at Objects/call.c:281
#7  0x0000155554f26be0 in do_call_core (kwdict=0x0, callargs=0x1555501ecba0, func=0x1555501d51c0, tstate=<optimized out>) at Python/ceval.c:5120
#8  _PyEval_EvalFrameDefault (tstate=<optimized out>, f=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:3580

and visualize some context:

(gdb) info locals
i = 0
__tmp3 = <optimized out>
__tmp4 = <optimized out>
(gdb) p A_row
$1 = (unsigned int * __restrict__) 0x1553e2a00000

The generated code looks like:

 13 void __program_fusion_internal(fusion_t *__state, unsigned int * __restrict__ A_col, unsigned int * __restrict__ A_row, double * __restrict__ A_val, double * __restrict__ __return, double * __restrict__ x, long long M, long long N, long long nnz)
 14 {
 15     long long i;
 16     unsigned int __tmp3;
 17     unsigned int __tmp4;
 18 
 19 
 20 
 21 
 22     for (i = 0; (i < M); i = (i + 1)) {
 23 
 24         __tmp3 = A_row[i];
 25         __tmp4 = A_row[(i + 1)];

The environment looks like:

(.venv) epaone@nid01934:/scratch/snx3000/epaone/repo/npbench> nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Mon_Nov_30_19:08:53_PST_2020
Cuda compilation tools, release 11.2, V11.2.67
Build cuda_11.2.r11.2/compiler.29373293_0
(.venv) epaone@nid01934:/scratch/snx3000/epaone/repo/npbench> python --version
Python 3.9.4
(.venv) epaone@nid01934:/scratch/snx3000/epaone/repo/npbench> pip list
Package             Version  Editable project location
------------------- -------- ------------------------------------
aenum               3.1.12
astunparse          1.6.3
blinker             1.6.2
certifi             2023.5.7
chardet             5.1.0
charset-normalizer  3.1.0
click               8.1.3
commonmark          0.9.1
contourpy           1.0.7
cupy-cuda11x        12.1.0
cycler              0.11.0
dace                0.14.2
dill                0.3.6
exceptiongroup      1.1.1
fastrlock           0.8.1
Flask               2.3.2
fonttools           4.39.4
idna                3.4
importlib-metadata  6.6.0
importlib-resources 5.12.0
iniconfig           2.0.0
itsdangerous        2.1.2
Jinja2              3.1.2
kiwisolver          1.4.4
llvmlite            0.40.0
MarkupSafe          2.1.2
matplotlib          3.7.1
mpmath              1.3.0
networkx            3.1
npbench             0.1      /scratch/snx3000/epaone/repo/npbench
numba               0.57.0
numpy               1.24.3
packaging           23.1
pandas              2.0.1
Pillow              9.5.0
pip                 23.1.2
pluggy              1.0.0
ply                 3.11
Pygments            2.15.1
pygount             1.5.1
pyparsing           3.0.9
pytest              7.3.1
python-dateutil     2.8.2
pytz                2023.3
PyYAML              6.0
requests            2.30.0
rich                12.6.0
scipy               1.10.1
setuptools          49.2.1
six                 1.16.0
sympy               1.9
tomli               2.0.1
tzdata              2023.3
urllib3             2.0.2
websockets          11.0.3
Werkzeug            2.3.4
wheel               0.40.0
zipp                3.15.0

The text was updated successfully, but these errors were encountered:

edopao · 2023-06-15T07:34:23Z

The problem seems to be caused by the fact that array A_row has storage set to StorageType.GPU_Global, so the generated host code will attempt to access a memory address on device memory:

(.venv) epaone@nid05267:/scratch/snx3000/epaone/repo/npbench> LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH cuda-memcheck python run_benchmark.py -b spmv -f dace_gpu
========= CUDA-MEMCHECK
***** Testing DaCe GPU with spmv on the S dataset *****
NumPy - default - validation: 27ms
========= Error: process didn't terminate successfully                                                                                                                   
=========        The application may have hit an error when dereferencing Unified Memory from the host. Please rerun the application under cuda-gdb or Nsight Eclipse Edition to catch host 
side errors.
========= No CUDA-MEMCHECK results found

It looks like all access nodes get GPU_Global allocation by default, no matter how they are used in the generated code:

        def copy_to_gpu(sdfg):
            for k, v in sdfg.arrays.items():
                if not v.transient and isinstance(v, dace.data.Array):
                    v.storage = dace.dtypes.StorageType.GPU_Global

edopao · 2023-06-19T10:54:04Z

I have skipped the fusion execution step of SPMV in npbench, and I can observe that parallel and auto_opt implementations work fine.

(.venv) epaone@nid01933:/scratch/snx3000/epaone/repo/npbench> LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH python run_benchmark.py -b spmv -f dace_gpu
***** Testing DaCe GPU with spmv on the S dataset *****
NumPy - default - validation: 13ms
DaCe GPU - parallel - first/validation: 13ms
DaCe GPU - parallel - validation: SUCCESS
DaCe GPU - parallel - median: 2ms
DaCe GPU - auto_opt - first/validation: 4ms
DaCe GPU - auto_opt - validation: SUCCESS
DaCe GPU - auto_opt - median: 2ms

The reason is that the parallel optimisation converts the outer for loop into a map, then in the generated code A_row is accessed on device.
Therefore, this issue is limited to the fusion step and, in general, to for loops where inter-state symbols are defined based on array values.

alexnick83 · 2023-06-30T13:55:16Z

We are going to start working on this. #1291 has a quick fix for the specific benchmark if you want to test it.

edopao · 2023-07-04T13:10:14Z

I have tested it on GPU and it works. Thanks

edopao · 2023-07-04T13:20:43Z

However, I see now that the autoopt transformation step fails:

***** Testing DaCe GPU with spmv on the S dataset *****
NumPy - default - validation: 26ms
DaCe autoopt failed
DaCe GPU - fusion - first/validation: 3870ms
DaCe GPU - fusion - validation: SUCCESS
DaCe GPU - fusion - median: 148ms
DaCe GPU - parallel - first/validation: 30ms
DaCe GPU - parallel - validation: SUCCESS
DaCe GPU - parallel - median: 20ms

Here is the full exception from the transformation:

Trying to read an inaccessible data container "A_row" (Storage: StorageType.GPU_Global) in host code interstate edge (at edge "__tmp3=A_row[i],__tmp4=A_row[(i + 1)]" (state -> slice_x_26)
Invalid SDFG saved for inspection in /scratch/e1000/epaone/repo/npbench/_dacegraphs/invalid.sdfg
Traceback (most recent call last):
  File "/scratch/e1000/epaone/repo/npbench/npbench/infrastructure/dace_framework.py", line 202, in implementations
    _, auto_time = util.benchmark(f"autoopt(auto_opt_sdfg, device, symbols = locals())",
  File "/scratch/e1000/epaone/repo/npbench/npbench/infrastructure/utilities.py", line 140, in benchmark
    output = timeit.repeat(stmt, setup=setup, repeat=repeat, number=1, globals=ldict)
  File "/user-environment/linux-sles15-zen2/gcc-11.3.0/python-3.10.10-aqwgvaqvbf5q4uzo2elz2cbc35xmfo6s/lib/python3.10/timeit.py", line 239, in repeat
    return Timer(stmt, setup, timer, globals).repeat(repeat, number)
  File "/user-environment/linux-sles15-zen2/gcc-11.3.0/python-3.10.10-aqwgvaqvbf5q4uzo2elz2cbc35xmfo6s/lib/python3.10/timeit.py", line 206, in repeat
    t = self.timeit(number)
  File "/user-environment/linux-sles15-zen2/gcc-11.3.0/python-3.10.10-aqwgvaqvbf5q4uzo2elz2cbc35xmfo6s/lib/python3.10/timeit.py", line 178, in timeit
    timing = self.inner(it, self.timer)
  File "<timeit-src>", line 6, in inner
  File "/scratch/e1000/epaone/repo/npbench/npbench/infrastructure/dace_framework.py", line 195, in autoopt
    opt.auto_optimize(auto_opt_sdfg, device, symbols=symbols)
  File "/scratch/e1000/epaone/repo/dace/dace/transformation/auto/auto_optimize.py", line 547, in auto_optimize
    sdfg.apply_transformations_repeated(TrivialMapElimination, validate=validate, validate_all=validate_all)
  File "/scratch/e1000/epaone/repo/dace/dace/sdfg/sdfg.py", line 2524, in apply_transformations_repeated
    results = pazz.apply_pass(self, {})
  File "/scratch/e1000/epaone/repo/dace/dace/transformation/passes/pattern_matching.py", line 253, in apply_pass
    return self._apply_pass(sdfg, pipeline_results, apply_once=False)
  File "/scratch/e1000/epaone/repo/dace/dace/transformation/passes/pattern_matching.py", line 241, in _apply_pass
    raise err
  File "/scratch/e1000/epaone/repo/dace/dace/transformation/passes/pattern_matching.py", line 235, in _apply_pass
    sdfg.validate()
  File "/scratch/e1000/epaone/repo/dace/dace/sdfg/sdfg.py", line 2354, in validate
    validate_sdfg(self, references, **context)
  File "/scratch/e1000/epaone/repo/dace/dace/sdfg/validation.py", line 174, in validate_sdfg
    raise InvalidSDFGInterstateEdgeError(
dace.sdfg.validation.InvalidSDFGInterstateEdgeError: Trying to read an inaccessible data container "A_row" (Storage: StorageType.GPU_Global) in host code interstate edge (at edge "__tmp3=A_row[i],__tmp4=A_row[(i + 1)]" (state -> slice_x_26)
Invalid SDFG saved for inspection in /scratch/e1000/epaone/repo/npbench/_dacegraphs/invalid.sdfg

alexnick83 · 2023-07-18T11:52:47Z

This is a workflow/ordering of transformations-related issue. We have introduced a workaround in the branch autoopt-device-validation-issues but we will discuss in the next meeting how to better address the problem. See also #1323

alexnick83 · 2023-11-13T23:08:44Z

The issue is finally being fixed by utilizing the auto-optimizer's new use_gpu_storage flag. See also NPBench's #20

edopao · 2023-11-16T12:18:53Z

Thank you. I have also tested it and it works.

edopao closed this as completed Nov 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segmentation fault in SPMV benchmark on GPU target #1269

Segmentation fault in SPMV benchmark on GPU target #1269

edopao commented Jun 7, 2023

edopao commented Jun 15, 2023 •

edited

Loading

edopao commented Jun 19, 2023

alexnick83 commented Jun 30, 2023

edopao commented Jul 4, 2023

edopao commented Jul 4, 2023 •

edited

Loading

alexnick83 commented Jul 18, 2023

alexnick83 commented Nov 13, 2023

edopao commented Nov 16, 2023

Segmentation fault in SPMV benchmark on GPU target #1269

Segmentation fault in SPMV benchmark on GPU target #1269

Comments

edopao commented Jun 7, 2023

edopao commented Jun 15, 2023 • edited Loading

edopao commented Jun 19, 2023

alexnick83 commented Jun 30, 2023

edopao commented Jul 4, 2023

edopao commented Jul 4, 2023 • edited Loading

alexnick83 commented Jul 18, 2023

alexnick83 commented Nov 13, 2023

edopao commented Nov 16, 2023

edopao commented Jun 15, 2023 •

edited

Loading

edopao commented Jul 4, 2023 •

edited

Loading