Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault with cuda 11.3 #169

Closed
hzliangbin opened this issue Oct 27, 2022 · 13 comments
Closed

Segmentation fault with cuda 11.3 #169

hzliangbin opened this issue Oct 27, 2022 · 13 comments

Comments

@hzliangbin
Copy link

hzliangbin commented Oct 27, 2022

Enviroment Info

gpu-manager version: built on master
vcuda version: thomassong/vcuda:v1.0.5
nvidia driver: 470.57.02

details

I run a gpu pod with base image included cuda runtime versioin 11.3.1, and test deviceQuery,
but it ran into segmentation fault as belows:
企业微信截图_63e1efbb-3dd1-4595-afa4-f3e1f9739d00

but deviceQuery in 11.1 samples was tested ok when i use a base image included cuda 11.1.1.
企业微信截图_a4c23132-1ecc-4468-853c-1600abf7432c

there is no vcuda log output although LOGGER_LEVEL is set to 6, some help info followed:

企业微信截图_af26d52f-2a39-438c-9ccb-259082843a93

and if I skip the libcuda-controll.so,

mkdir /root/lib64; cd /root/lib64; cp /usr/local/nvidia/lib64/libcuda.so.470.57.02 ./; ln -sf libcuda.so.470.57.02 libcuda.so.1; ln -sf libcuda.so.1 libcuda.so;
export LD_LIBRARY_PATH=/root/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/cuda/targets/x86_64-linux/lib/stubs:/usr/local/jdk/jre/lib/amd64/server

it's also managed to run deviceQuery in cuda 11.3.1. seems something wrong with vcuda.

May I have your help? thx a lot.

@hzliangbin
Copy link
Author

@mYmNeo @genedna could you have a look at this if convenient?

@mYmNeo
Copy link
Contributor

mYmNeo commented Oct 27, 2022

Open coredump, and use gdb to see which instruction caused crash

@hzliangbin
Copy link
Author

hzliangbin commented Nov 2, 2022

1667371703323

reproduce:

use image: nvidia/cuda:11.3.0-cudnn8-devel-ubuntu20.04

git clone https://github.com/NVIDIA/cuda-samples.git

cd cuda-samples; git checkout v11.3; cd Samples/deviceQuery/; make dbg=1

./deviceQuery

Program terminated with signal SIGSEGV, Segmentation fault.

@mYmNeo seems somewhere null pointer error, i guess pfn dit not init, but i couldn't dig into it.

@hzliangbin
Copy link
Author

hzliangbin commented Nov 2, 2022

it‘s similar to vcuda-controller/issues/20, but cuGetProcAddress was introduced in 11.3, refer to this.

@mYmNeo
Copy link
Contributor

mYmNeo commented Nov 3, 2022

@hzliangbin re-pull thomassong/vcuda:latest and re-build gpu-manager image, see whether the problem is resolved.

@hzliangbin
Copy link
Author

@mYmNeo sorry to say that it makes no difference.

@hzliangbin
Copy link
Author

hzliangbin commented Nov 4, 2022

cuGetProcAddress required pfn as a type void**, but seems here receives a error type.

image

@mYmNeo
Copy link
Contributor

mYmNeo commented Nov 4, 2022

cuGetProcAddress required pfn as a type void**, but seems here receives a error type.

image

Did you have a successful result on a full card not share gpu with cuda 11.3?

@hzliangbin
Copy link
Author

@mYmNeo yes, when I set vcore=100 to use a full card,it leads a a successful result with no segment fault.

@mYmNeo
Copy link
Contributor

mYmNeo commented Nov 4, 2022

@mYmNeo yes, when I set vcore=100 to use a full card,it leads a a successful result with no segment fault.

I don't have any gpu resource, it's hard to tell where caused the pfn not assigned.

@hzliangbin
Copy link
Author

  1. yaml to deploy a GPU pod, set vcuda-core "100" to use full card
apiVersion: v1
kind: Pod
metadata:
  name: vcuda-test2
  namespace: default
spec:
  nodeName: 10.68.129.30
  containers:
  - command:
    - sleep
    - 1d
    env:
    - name: PATH
      value: /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
    - name: LOGGER_LEVEL
      value: "5"
      #image: ccr.ccs.tencentyun.com/menghe/tensorflow-gputest:0.2
      # use cuda 11.3 
    image: nvidia/cuda:11.3.0-cudnn8-devel-ubuntu20.04
    imagePullPolicy: IfNotPresent
    name: tensorflow-test
    resources:
      limits:
        cpu: "4"
        memory: 8Gi
        huya.com/vcuda-core: "100"
        huya.com/vcuda-memory: "32"
      requests:
        cpu: "4"
        memory: 8Gi
        huya.com/vcuda-core: "100"
        huya.com/vcuda-memory: "32"
  1. go into vcuda-test2 pod, and install git gdb strace tools.
  2. clone nvidia cuda samples and build deviceQuery for test
git clone https://github.com/NVIDIA/cuda-samples.git
cd cuda-samples; git checkout v11.1; cd Samples/deviceQuery/; make dbg=1
  1. exec deviceQuery and see the output, a full card test is ok.

image

@hzliangbin
Copy link
Author

@mYmNeo Hi, sorry to trouble you again, I found some relative issue about it. How to hook CUDA runtime API in CUDA 11.4 and intercept-cuda-11.3-demo.

The first symbol is ‘cuGetProcAddress’. It seems a loop. cuGetProcAddress tries to get the address of itself. How to get the real address of cuGetProcAddress

I guess that's the main cause. It's seems that more job need to be done to compatible with version above 11.3.

Please leave the issue open, I will keep refreshing my job.

@mYmNeo
Copy link
Contributor

mYmNeo commented Nov 15, 2022

closed by tkestack/vcuda-controller#30

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants