-
Notifications
You must be signed in to change notification settings - Fork 237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segmentation fault with cuda 11.3 #169
Comments
Open coredump, and use gdb to see which instruction caused crash |
reproduce: use image: nvidia/cuda:11.3.0-cudnn8-devel-ubuntu20.04 git clone https://github.com/NVIDIA/cuda-samples.git cd cuda-samples; git checkout v11.3; cd Samples/deviceQuery/; make dbg=1 ./deviceQuery Program terminated with signal SIGSEGV, Segmentation fault. @mYmNeo seems somewhere null pointer error, i guess pfn dit not init, but i couldn't dig into it. |
it‘s similar to vcuda-controller/issues/20, but |
@hzliangbin re-pull |
@mYmNeo sorry to say that it makes no difference. |
@mYmNeo yes, when I set vcore=100 to use a full card,it leads a a successful result with no segment fault. |
I don't have any gpu resource, it's hard to tell where caused the |
apiVersion: v1
kind: Pod
metadata:
name: vcuda-test2
namespace: default
spec:
nodeName: 10.68.129.30
containers:
- command:
- sleep
- 1d
env:
- name: PATH
value: /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
- name: LOGGER_LEVEL
value: "5"
#image: ccr.ccs.tencentyun.com/menghe/tensorflow-gputest:0.2
# use cuda 11.3
image: nvidia/cuda:11.3.0-cudnn8-devel-ubuntu20.04
imagePullPolicy: IfNotPresent
name: tensorflow-test
resources:
limits:
cpu: "4"
memory: 8Gi
huya.com/vcuda-core: "100"
huya.com/vcuda-memory: "32"
requests:
cpu: "4"
memory: 8Gi
huya.com/vcuda-core: "100"
huya.com/vcuda-memory: "32"
git clone https://github.com/NVIDIA/cuda-samples.git
cd cuda-samples; git checkout v11.1; cd Samples/deviceQuery/; make dbg=1
|
@mYmNeo Hi, sorry to trouble you again, I found some relative issue about it. How to hook CUDA runtime API in CUDA 11.4 and intercept-cuda-11.3-demo. The first symbol is ‘cuGetProcAddress’. It seems a loop. cuGetProcAddress tries to get the address of itself. How to get the real address of cuGetProcAddress I guess that's the main cause. It's seems that more job need to be done to compatible with version above 11.3. Please leave the issue open, I will keep refreshing my job. |
closed by tkestack/vcuda-controller#30 |
Enviroment Info
gpu-manager version: built on master
vcuda version: thomassong/vcuda:v1.0.5
nvidia driver: 470.57.02
details
I run a gpu pod with base image included cuda runtime versioin 11.3.1, and test deviceQuery,
![企业微信截图_63e1efbb-3dd1-4595-afa4-f3e1f9739d00](https://user-images.githubusercontent.com/9740458/198228613-b3cffd0f-46f9-472e-8f31-4eaab85e6c60.png)
but it ran into
segmentation fault
as belows:but deviceQuery in 11.1 samples was tested ok when i use a base image included cuda 11.1.1.
![企业微信截图_a4c23132-1ecc-4468-853c-1600abf7432c](https://user-images.githubusercontent.com/9740458/198229100-efc6a57a-fbba-46c3-aad7-e31f619a0504.png)
there is no vcuda log output although
LOGGER_LEVEL
is set to 6, some help info followed:and if I skip the libcuda-controll.so,
it's also managed to run
deviceQuery
in cuda 11.3.1. seems something wrong with vcuda.May I have your help? thx a lot.
The text was updated successfully, but these errors were encountered: