Segmentation fault with cuda 11.3 #169

hzliangbin · 2022-10-27T08:18:29Z

Enviroment Info

gpu-manager version: built on master
vcuda version: thomassong/vcuda:v1.0.5
nvidia driver: 470.57.02

details

I run a gpu pod with base image included cuda runtime versioin 11.3.1, and test deviceQuery,
but it ran into segmentation fault as belows:

but deviceQuery in 11.1 samples was tested ok when i use a base image included cuda 11.1.1.

there is no vcuda log output although LOGGER_LEVEL is set to 6, some help info followed:

and if I skip the libcuda-controll.so,

mkdir /root/lib64; cd /root/lib64; cp /usr/local/nvidia/lib64/libcuda.so.470.57.02 ./; ln -sf libcuda.so.470.57.02 libcuda.so.1; ln -sf libcuda.so.1 libcuda.so;
export LD_LIBRARY_PATH=/root/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/cuda/targets/x86_64-linux/lib/stubs:/usr/local/jdk/jre/lib/amd64/server

it's also managed to run deviceQuery in cuda 11.3.1. seems something wrong with vcuda.

May I have your help? thx a lot.

The text was updated successfully, but these errors were encountered:

hzliangbin · 2022-10-27T08:21:02Z

@mYmNeo @genedna could you have a look at this if convenient?

mYmNeo · 2022-10-27T11:09:02Z

Open coredump, and use gdb to see which instruction caused crash

hzliangbin · 2022-11-02T07:03:38Z

reproduce：

use image: nvidia/cuda:11.3.0-cudnn8-devel-ubuntu20.04

git clone https://github.com/NVIDIA/cuda-samples.git

cd cuda-samples; git checkout v11.3; cd Samples/deviceQuery/; make dbg=1

./deviceQuery

Program terminated with signal SIGSEGV, Segmentation fault.

@mYmNeo seems somewhere null pointer error, i guess pfn dit not init， but i couldn't dig into it.

hzliangbin · 2022-11-02T11:25:13Z

it‘s similar to vcuda-controller/issues/20， but cuGetProcAddress was introduced in 11.3, refer to this.

mYmNeo · 2022-11-03T11:12:57Z

@hzliangbin re-pull thomassong/vcuda:latest and re-build gpu-manager image, see whether the problem is resolved.

hzliangbin · 2022-11-04T06:21:23Z

@mYmNeo sorry to say that it makes no difference.

hzliangbin · 2022-11-04T06:25:54Z

cuGetProcAddress required pfn as a type void**, but seems here receives a error type.

mYmNeo · 2022-11-04T07:14:16Z

cuGetProcAddress required pfn as a type void**, but seems here receives a error type.

Did you have a successful result on a full card not share gpu with cuda 11.3?

hzliangbin · 2022-11-04T09:08:09Z

@mYmNeo yes， when I set vcore=100 to use a full card，it leads a a successful result with no segment fault.

mYmNeo · 2022-11-04T09:33:57Z

@mYmNeo yes， when I set vcore=100 to use a full card，it leads a a successful result with no segment fault.

I don't have any gpu resource, it's hard to tell where caused the pfn not assigned.

hzliangbin · 2022-11-04T09:34:00Z

yaml to deploy a GPU pod, set vcuda-core "100" to use full card

apiVersion: v1
kind: Pod
metadata:
  name: vcuda-test2
  namespace: default
spec:
  nodeName: 10.68.129.30
  containers:
  - command:
    - sleep
    - 1d
    env:
    - name: PATH
      value: /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
    - name: LOGGER_LEVEL
      value: "5"
      #image: ccr.ccs.tencentyun.com/menghe/tensorflow-gputest:0.2
      # use cuda 11.3 
    image: nvidia/cuda:11.3.0-cudnn8-devel-ubuntu20.04
    imagePullPolicy: IfNotPresent
    name: tensorflow-test
    resources:
      limits:
        cpu: "4"
        memory: 8Gi
        huya.com/vcuda-core: "100"
        huya.com/vcuda-memory: "32"
      requests:
        cpu: "4"
        memory: 8Gi
        huya.com/vcuda-core: "100"
        huya.com/vcuda-memory: "32"

go into vcuda-test2 pod, and install git gdb strace tools.
clone nvidia cuda samples and build deviceQuery for test

git clone https://github.com/NVIDIA/cuda-samples.git
cd cuda-samples; git checkout v11.1; cd Samples/deviceQuery/; make dbg=1

exec deviceQuery and see the output, a full card test is ok.

hzliangbin · 2022-11-07T02:49:44Z

@mYmNeo Hi, sorry to trouble you again, I found some relative issue about it. How to hook CUDA runtime API in CUDA 11.4 and intercept-cuda-11.3-demo.

The first symbol is ‘cuGetProcAddress’. It seems a loop. cuGetProcAddress tries to get the address of itself. How to get the real address of cuGetProcAddress

I guess that's the main cause. It's seems that more job need to be done to compatible with version above 11.3.

Please leave the issue open, I will keep refreshing my job.

mYmNeo · 2022-11-15T01:40:19Z

closed by tkestack/vcuda-controller#30

VincentLeeMax mentioned this issue Nov 14, 2022

add hijack entrypoint tkestack/vcuda-controller#30

Merged

mYmNeo closed this as completed Nov 15, 2022

hzliangbin mentioned this issue Nov 15, 2022

add cuGetProcAddress to cuda_hook_entry list tkestack/vcuda-controller#31

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segmentation fault with cuda 11.3 #169

Segmentation fault with cuda 11.3 #169

hzliangbin commented Oct 27, 2022 •

edited

Loading

hzliangbin commented Oct 27, 2022

mYmNeo commented Oct 27, 2022

hzliangbin commented Nov 2, 2022 •

edited

Loading

hzliangbin commented Nov 2, 2022 •

edited

Loading

mYmNeo commented Nov 3, 2022

hzliangbin commented Nov 4, 2022

hzliangbin commented Nov 4, 2022 •

edited

Loading

mYmNeo commented Nov 4, 2022

hzliangbin commented Nov 4, 2022

mYmNeo commented Nov 4, 2022

hzliangbin commented Nov 4, 2022

hzliangbin commented Nov 7, 2022

mYmNeo commented Nov 15, 2022

Segmentation fault with cuda 11.3 #169

Segmentation fault with cuda 11.3 #169

Comments

hzliangbin commented Oct 27, 2022 • edited Loading

Enviroment Info

details

hzliangbin commented Oct 27, 2022

mYmNeo commented Oct 27, 2022

hzliangbin commented Nov 2, 2022 • edited Loading

hzliangbin commented Nov 2, 2022 • edited Loading

mYmNeo commented Nov 3, 2022

hzliangbin commented Nov 4, 2022

hzliangbin commented Nov 4, 2022 • edited Loading

mYmNeo commented Nov 4, 2022

hzliangbin commented Nov 4, 2022

mYmNeo commented Nov 4, 2022

hzliangbin commented Nov 4, 2022

hzliangbin commented Nov 7, 2022

mYmNeo commented Nov 15, 2022

hzliangbin commented Oct 27, 2022 •

edited

Loading

hzliangbin commented Nov 2, 2022 •

edited

Loading

hzliangbin commented Nov 2, 2022 •

edited

Loading

hzliangbin commented Nov 4, 2022 •

edited

Loading