Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Having some problem while testing #13

Open
benesse1899 opened this issue Apr 26, 2021 · 10 comments
Open

Having some problem while testing #13

benesse1899 opened this issue Apr 26, 2021 · 10 comments

Comments

@benesse1899
Copy link

I've followed the step in https://asciinema.org/a/302094,
but after I create the sharePod and input kubectl get sharepod pod1 -o yaml | egrep -m2 'GPUID|nodeName' ,
it didn't show anything. Why?

I don't know if it is because of my cuda version is 11.2 or what.

BTW, do I need to run the makefile? I'm not sure if that is necessary.

@StarCoral
Copy link
Contributor

StarCoral commented Apr 27, 2021

Hi,

Please make sure that the following settings are completed:

  1. cuda works
  • nvcc
  • nvidia-smi
  1. docker configured with nvidia as the default runtime
Runtimes: nvidia runc
Default Runtime: nvidia
  1. nvidia device plugin is running
    see more: https://github.com/NVIDIA/k8s-device-plugin

After completing the above work, you can install KubeShare
Please also confirm that the status of KubeShare's Pod is running.

  • kubeshare-scheduler
  • kubeshare-device-manager
  • kubeshare-node-daemon

BTW, I don't think it is caused by the cuda version 11.2 because I try to build on cuda 11.2 and it works.
You not need to run the makefile unless you modify the code.

If you still have questions, please provide logs

@benesse1899
Copy link
Author

Sorry, I've checked those things you mentioned above but the result is still nothing.

  1. cuda works
    nvcc:
root@k8s-node1:/home/ms0244456# nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Nov__3_21:07:56_CDT_2017
Cuda compilation tools, release 9.1, V9.1.85 

nvidia-smi:

root@k8s-node1:/home/ms0244456# nvidia-smi
Wed Apr 28 04:43:38 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.73.01    Driver Version: 460.73.01    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce GTX 950     Off  | 00000000:01:00.0  On |                  N/A |
| 38%   30C    P8     7W /  90W |    131MiB /  1993MiB |      1%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1794      G   /usr/lib/xorg/Xorg                100MiB |
|    0   N/A  N/A      2216      G   /usr/bin/gnome-shell               25MiB |
+-----------------------------------------------------------------------------+
  1. docker configured with nvidia as the default runtime
    Those two are also checked through docker info
 Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux nvidia runc
 Default Runtime: nvidia
  1. nvidia device plugin
    I use kubectl describe to check if my k8s cluster has get the gpu from the worker node
Capacity:
  cpu:                12
  ephemeral-storage:  239314556Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             8027428Ki
  nvidia.com/gpu:     1
  pods:               110
Allocatable:
  cpu:                12
  ephemeral-storage:  220552294445
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             7925028Ki
  nvidia.com/gpu:     1
  pods:               110
  1. KubeShare's pods
root@k8s-master:/home/k8s-master/demo# kubectl get pods -n kube-system
NAME                                   READY   STATUS    RESTARTS   AGE
coredns-f9fd979d6-6cgct                1/1     Running   16         196d
coredns-f9fd979d6-qsbr8                1/1     Running   16         196d
etcd-k8s-master                        1/1     Running   12         196d
kube-apiserver-k8s-master              1/1     Running   8          160d
kube-controller-manager-k8s-master     1/1     Running   10         160d
kube-flannel-ds-lfn7l                  1/1     Running   10         7d
kube-flannel-ds-t7gch                  1/1     Running   0          56d
kube-flannel-ds-vhqmx                  1/1     Running   14         196d
kube-flannel-ds-wt9q9                  1/1     Running   21         196d
kube-proxy-6mpl5                       1/1     Running   0          56d
kube-proxy-czvgt                       1/1     Running   10         7d
kube-proxy-n8sb7                       1/1     Running   12         196d
kube-proxy-ncqm2                       1/1     Running   21         196d
kube-scheduler-k8s-master              1/1     Running   15         193d
kubeshare-device-manager               1/1     Running   0          165m
kubeshare-node-daemon-bbjbm            2/2     Running   0          165m
kubeshare-node-daemon-mtdx2            2/2     Running   1          165m
kubeshare-node-daemon-x8r4t            2/2     Running   0          165m
kubeshare-node-daemon-z4l7d            2/2     Running   0          165m
kubeshare-scheduler                    1/1     Running   0          165m
nvidia-device-plugin-daemonset-pw8hh   1/1     Running   1          3d11h
nvidia-device-plugin-daemonset-wgv44   1/1     Running   0          3d11h
nvidia-device-plugin-daemonset-wtp2b   1/1     Running   0          3d11h

My sharepod1 & 2

apiVersion: kubeshare.nthu/v1
kind: SharePod
metadata:
  name: pod1
  annotations:
    "kubeshare/gpu_request": "0.4"
    "kubeshare/gpu_limit": "1.0"
spec:
  terminationGracePeriodSeconds: 0
  containers:
  - name: tf
    image: tensorflow/tensorflow:1.15.2-gpu-py3
    command: ["sh", "-c", "curl -s https://lsalab.cs.nthu.edu.tw/~ericyeh/KubeShare/demo/mnist.py | python3 -"]
  restartPolicy: OnFailure
-----------------------------------------------------------------------------------------------
apiVersion: kubeshare.nthu/v1
kind: SharePod
metadata:
  name: pod2
  annotations:
    "kubeshare/gpu_request": "0.6"
    "kubeshare/gpu_limit": "1.0"
spec:
  terminationGracePeriodSeconds: 0
  containers:
  - name: tf
    image: tensorflow/tensorflow:1.15.2-gpu-py3
    command: ["sh", "-c", "curl -s https://lsalab.cs.nthu.edu.tw/~ericyeh/KubeShare/demo/mnist.py | python3 -"]
  restartPolicy: OnFailure

Test command:

root@k8s-master:/home/k8s-master/demo# kubectl get sharepod
NAME   AGE
pod1   36m
pod2   36m
root@k8s-master:/home/k8s-master/demo# kubectl get pod
NAME                       READY   STATUS    RESTARTS   AGE
gpu-test-7d87449d9-dvsvj   1/1     Running   0          41d
root@k8s-master:/home/k8s-master/demo# kubectl get sharepod pod1 -o yaml | egrep -m2 'GPUID|nodeName'
root@k8s-master:/home/k8s-master/demo# 
root@k8s-master:/home/k8s-master/demo# kubectl get sharepod pod2 -o yaml | egrep -m2 'GPUID|nodeName'
root@k8s-master:/home/k8s-master/demo# 

My docker version: 20.10.6

And I want to ask if I can add nodeSelector in sharepod.yaml or not?

@StarCoral
Copy link
Contributor

Hi @benesse1899

  1. docker configured with nvidia as the default runtime
    Actually, I’m not sure if there are many runtimes in the environment that can be detected correctly.
    You can check the working status of the nvidia device plugin by submitting the gpu pod
apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  containers:
    - name: cuda-container
      image: nvcr.io/nvidia/cuda:9.0-devel
      resources:
        limits:
          nvidia.com/gpu: 2 # requesting 2 GPUs
    - name: digits-container
      image: nvcr.io/nvidia/digits:20.12-tensorflow-py3
      resources:
        limits:
          nvidia.com/gpu: 2 # requesting 2 GPUs

Also we provide you with our settings

cat /etc/docker/daemon.json
{
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
        }
    },
    "default-runtime": "nvidia"
}

When applying sharepod, please provide some logs about KubeShare pod

kubectl logs -n kube-system kubeshare-node-daemon
kubectl logs -n kube-system kubeshare-device-manager
kubectl logs -n kube-system kubeshare-scheduler
  1. current version is not support nodeSelector

@benesse1899
Copy link
Author

Hi @StarCoral
The setting of

cat /etc/docker/daemon.json

I've done already, but I forgot to show in my previous reply, sorry for that.

And because I only have one gpu in each node(I have two nodes, and my master node doesn't have gpu), so I use this pod to check

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  containers:
    - name: cuda-container
      image: nvidia/cuda:11.0-devel
      resources:
        limits:
          nvidia.com/gpu: 1
      command: ["/bin/sh","-c","--"]
      args: ["while true; do sleep 30; done;"]
  nodeSelector:
    accelerator: nvidia-gtx-950

Then I go into the pod and write nvidia-smi

root@gpu-pod:/# nvidia-smi
Mon May  3 09:37:41 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.73.01    Driver Version: 460.73.01    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce GTX 950     Off  | 00000000:01:00.0  On |                  N/A |
| 38%   30C    P8     6W /  90W |    127MiB /  1993MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

And the log of kubeshare-device-manager

I0503 09:59:58.803297       1 reflector.go:418] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:105: Watch close - *v1.Pod total 15 items received
I0503 10:00:00.263311       1 config.go:188] Receive heartbeat from node: k8s-node2
I0503 10:00:08.498214       1 config.go:188] Receive heartbeat from node: k8s-node1
I0503 10:00:12.644666       1 reflector.go:268] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:105: forcing resync
E0503 10:00:12.644818       1 controller.go:259] SharePod 'default/pod1' must be scheduled! Spec.NodeName is empty.
I0503 10:00:12.644847       1 controller.go:228] Successfully synced 'default/pod1'
E0503 10:00:12.644872       1 controller.go:259] SharePod 'default/pod2' must be scheduled! Spec.NodeName is empty.
I0503 10:00:12.646050       1 controller.go:228] Successfully synced 'default/pod2'
I0503 10:00:13.737963       1 reflector.go:268] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:105: forcing resync
I0503 10:00:15.263033       1 config.go:188] Receive heartbeat from node: k8s-node2
I0503 10:00:23.497876       1 config.go:188] Receive heartbeat from node: k8s-node1
I0503 10:00:30.262750       1 config.go:188] Receive heartbeat from node: k8s-node2
I0503 10:00:38.497394       1 config.go:188] Receive heartbeat from node: k8s-node1
I0503 10:00:42.645114       1 reflector.go:268] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:105: forcing resync
E0503 10:00:42.645331       1 controller.go:259] SharePod 'default/pod1' must be scheduled! Spec.NodeName is empty.
I0503 10:00:42.645370       1 controller.go:228] Successfully synced 'default/pod1'
E0503 10:00:42.645397       1 controller.go:259] SharePod 'default/pod2' must be scheduled! Spec.NodeName is empty.
I0503 10:00:42.646571       1 controller.go:228] Successfully synced 'default/pod2'
I0503 10:00:43.738445       1 reflector.go:268] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:105: forcing resync
I0503 10:00:45.262509       1 config.go:188] Receive heartbeat from node: k8s-node2

kubeshare-scheduler

I0503 09:58:14.250398       1 reflector.go:268] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:105: forcing resync
I0503 09:58:40.197416       1 reflector.go:418] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:105: Watch close - *v1.SharePod total 11 items received
I0503 09:58:42.735168       1 reflector.go:268] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:105: forcing resync
I0503 09:58:42.735273       1 reflector.go:268] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:105: forcing resync
I0503 09:58:44.250880       1 reflector.go:268] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:105: forcing resync
I0503 09:59:12.735621       1 reflector.go:268] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:105: forcing resync
I0503 09:59:12.735625       1 reflector.go:268] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:105: forcing resync
I0503 09:59:14.251096       1 reflector.go:268] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:105: forcing resync
I0503 09:59:42.735820       1 reflector.go:268] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:105: forcing resync
I0503 09:59:42.735874       1 reflector.go:268] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:105: forcing resync
I0503 09:59:44.251223       1 reflector.go:268] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:105: forcing resync
I0503 10:00:12.736060       1 reflector.go:268] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:105: forcing resync
I0503 10:00:12.736065       1 reflector.go:268] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:105: forcing resync
I0503 10:00:14.251711       1 reflector.go:268] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:105: forcing resync
I0503 10:00:42.736332       1 reflector.go:268] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:105: forcing resync
I0503 10:00:42.736334       1 reflector.go:268] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:105: forcing resync
I0503 10:00:44.251962       1 reflector.go:268] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:105: forcing resync
I0503 10:01:12.736534       1 reflector.go:268] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:105: forcing resync
I0503 10:01:12.736580       1 reflector.go:268] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:105: forcing resync
I0503 10:01:14.252086       1 reflector.go:268] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:105: forcing resync
I0503 10:01:23.262490       1 reflector.go:418] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:105: Watch close - *v1.Pod total 0 items received
I0503 10:01:42.736805       1 reflector.go:268] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:105: forcing resync

kubeshare-node-daemon of node1

I0503 09:59:38.515317       1 config-client.go:133] Send heartbeat: 2021-05-03 09:59:38.515280645 +0000 UTC m=+488971.102133620
I0503 09:59:53.515372       1 config-client.go:133] Send heartbeat: 2021-05-03 09:59:53.515325529 +0000 UTC m=+488986.102178499
I0503 10:00:08.515304       1 config-client.go:133] Send heartbeat: 2021-05-03 10:00:08.515271682 +0000 UTC m=+489001.102124690
I0503 10:00:23.515352       1 config-client.go:133] Send heartbeat: 2021-05-03 10:00:23.515320285 +0000 UTC m=+489016.102173259
I0503 10:00:38.515305       1 config-client.go:133] Send heartbeat: 2021-05-03 10:00:38.515273352 +0000 UTC m=+489031.102126244
I0503 10:00:53.515380       1 config-client.go:133] Send heartbeat: 2021-05-03 10:00:53.515325103 +0000 UTC m=+489046.102178074
I0503 10:01:08.515348       1 config-client.go:133] Send heartbeat: 2021-05-03 10:01:08.515316391 +0000 UTC m=+489061.102169363
I0503 10:01:23.515348       1 config-client.go:133] Send heartbeat: 2021-05-03 10:01:23.51531602 +0000 UTC m=+489076.102168993
I0503 10:01:38.515300       1 config-client.go:133] Send heartbeat: 2021-05-03 10:01:38.515266902 +0000 UTC m=+489091.102119855

@StarCoral
Copy link
Contributor

Hi @benesse1899,

The annotation kubeshare/gpu_mem in the spec of sharepod need to be set.
I think this is the reason why the Pod cannot be created.

@benesse1899
Copy link
Author

benesse1899 commented May 4, 2021

Hi @StarCoral ,
Well, I've tried add the kubeshare/gpu_mem back but still not working.
The value I set is 500000000.

system still can get sharepod

root@k8s-master:/home/k8s-master/demo# kubectl get sharepod
NAME   AGE
pod1   3h11m
pod2   3h11m

and the yaml of pod1

apiVersion: kubeshare.nthu/v1
kind: SharePod
metadata:
  annotations:
    kubeshare/gpu_limit: '1.0'
    kubeshare/gpu_mem: '500000000'
    kubeshare/gpu_request: '0.4'
  creationTimestamp: '2021-05-05T02:29:29Z'
  generation: 1
  managedFields:
    - apiVersion: kubeshare.nthu/v1
      fieldsType: FieldsV1
      fieldsV1:
        'f:metadata':
          'f:annotations':
            .: {}
            'f:kubeshare/gpu_limit': {}
            'f:kubeshare/gpu_mem': {}
            'f:kubeshare/gpu_request': {}
        'f:spec':
          .: {}
          'f:containers': {}
          'f:restartPolicy': {}
          'f:terminationGracePeriodSeconds': {}
      manager: kubectl-create
      operation: Update
      time: '2021-05-05T02:29:29Z'
  name: pod1
  namespace: default
  resourceVersion: '41039320'
  selfLink: /apis/kubeshare.nthu/v1/namespaces/default/sharepods/pod1
  uid: b2c42eaf-5424-44a3-a04b-25da8e9175eb
spec:
  containers:
    - command:
        - sh
        - '-c'
        - >-
          curl -s https://lsalab.cs.nthu.edu.tw/~ericyeh/KubeShare/demo/mnist.py
          | python3 -
      image: 'tensorflow/tensorflow:1.15.2-gpu-py3'
      name: tf
  restartPolicy: OnFailure
  terminationGracePeriodSeconds: 0

Should I show the info of my three nodes?

@StarCoral
Copy link
Contributor

According to the log of kubeshare-device-manager , the kubeshare-scheduler didn't find the node to assign pod.

I tried the yaml file you provided and it can work normally in our environment.
I may need more error message to be able to help you.

@y-ykcir
Copy link

y-ykcir commented Oct 4, 2022

Hi @StarCoral,

I met almost the same problem when I tried to deploy sharepod with YAML file in the doc/yaml/ folder

the log of kubeshare-device-manager showed that it didn't find the node to assign pod.

I1004 05:27:55.545279       1 reflector.go:268] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:105: forcing resync
E1004 05:27:55.545515       1 controller.go:259] SharePod 'default/pod1' must be scheduled! Spec.NodeName is empty.
I1004 05:27:55.545549       1 controller.go:228] Successfully synced 'default/pod1'

However, I found error messages in the Kubelet log after I deployed Kubeshare

Oct 04 13:21:51 master-OptiPlex-7050 kubelet[4346]: 2022-10-04 13:21:51.826 [INFO][17519] k8s.go 417: Wrote updated endpoint to datastore ContainerID="cfedeb9ce5aa8dc8e7e7633ac099dd24cb8dd045f289e9ce37844bbcb680f406" Namespace="kube-system" Pod="kubeshare-scheduler" WorkloadEndpoint="master--optiplex--7050-k8s-kubeshare--scheduler-eth0"
Oct 04 13:21:54 master-OptiPlex-7050 kubelet[4346]: E1004 13:21:54.098101    4346 remote_runtime.go:295] ContainerStatus "68b822e852c01812ead249fae6d0dd4efb601866441f3274587d710f456d5999" from runtime service failed: rpc error: code = Unknown desc = Error: No such container: 68b822e852c01812ead249fae6d0dd4efb601866441f3274587d710f456d5999
Oct 04 13:21:54 master-OptiPlex-7050 kubelet[4346]: E1004 13:21:54.098118    4346 kuberuntime_manager.go:952] getPodContainerStatuses for pod "kubeshare-scheduler_kube-system(6c4e994a-6cc6-4e59-abb7-2bbdc255a3e9)" failed: rpc error: code = Unknown desc = Error: No such container: 68b822e852c01812ead249fae6d0dd4efb601866441f3274587d710f456d5999
Oct 04 13:21:55 master-OptiPlex-7050 kubelet[4346]: W1004 13:21:55.654111    4346 container.go:412] Failed to create summary reader for "/kubepods/besteffort/pod94d51d32-dfca-461b-a15c-f3dde9aae779/9d98c32a777265b61be37b8b310c7041229fef235bd3bec1f087672800b8e8d3": none of the resources are being tracked.

But kubectl get pod -n kube-system looks good

coredns-7ff77c879f-htmhw                       1/1     Running   9          12d
coredns-7ff77c879f-qf5kt                       1/1     Running   9          12d
etcd-master-optiplex-7050                      1/1     Running   9          12d
kube-apiserver-master-optiplex-7050            1/1     Running   9          12d
kube-controller-manager-master-optiplex-7050   1/1     Running   9          12d
kube-proxy-547ps                               1/1     Running   0          6d1h
kube-proxy-54gbr                               1/1     Running   9          12d
kube-proxy-j784s                               1/1     Running   0          3h7m
kube-scheduler-master-optiplex-7050            1/1     Running   9          12d
kubeshare-device-manager                       1/1     Running   0          46m
kubeshare-node-daemon-6f7c7                    2/2     Running   1          46m
kubeshare-node-daemon-qrrlc                    2/2     Running   0          46m
kubeshare-node-daemon-rq272                    2/2     Running   0          46m
kubeshare-scheduler                            1/1     Running   0          46m

Maybe this is the reason, but I can't solve it.
Hope you are still paying attention to this issue and I can provide more information.

@justin0u0
Copy link
Contributor

Hi @y-ykcir, the error SharePod 'default/pod1' must be scheduled! Spec.NodeName is empty. may be a temparory error and it will be requeued to be handle again, which you can see Successfully synced 'default/pod1' after the error.

Can you see the pod running normally in the default namespace?

@y-ykcir
Copy link

y-ykcir commented Oct 10, 2022

Hi @justin0u0, I can't see the pod running normally in the default namespace. However, It seemed to work after I re-installed Kubernetes and Kubeshare.
But I'm still stuck in issue #19, could you help me answer it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants