Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

some problems while testing resource isolation #19

Open
y-ykcir opened this issue Oct 10, 2022 · 0 comments
Open

some problems while testing resource isolation #19

y-ykcir opened this issue Oct 10, 2022 · 0 comments

Comments

@y-ykcir
Copy link

y-ykcir commented Oct 10, 2022

Hi, I met some problems while testing resource isolation. The KubeShare seems to be running normally, but the isolation specified by annotation fails to achieve the expected effect.

My Enviornment

  • GPU: NVIDIA GeForce RTX 3090 Ti (24GiB)
  • CPU: Intel(R) Core(TM) i9-10900K CPU @ 3.70GHz
  • docker version: 20.10.12
  • nvidia-docker2 version: 2.11.0 ( default rumtime )
  • nvidia driver version: 510.73 host cuda driver version: 11.6
  • kubernetes client: v1.20.0 server: v1.20.15 (single node with gpu)

Resource isolation test

shared pod file:

apiVersion: kubeshare.nthu/v1
kind: SharePod
metadata:
  name: sharepod1
  annotations:
    "kubeshare/gpu_request": "0.5"
    "kubeshare/gpu_limit": "0.6"
    "kubeshare/gpu_mem": "10485760000"
spec:
  terminationGracePeriodSeconds: 0
  containers:
  - name: tensorflow-benchmark
    image: registry.cn-beijing.aliyuncs.com/ai-samples/gpushare-sample:benchmark-tensorflow-2.4.0
    command:
    - bash
    - run.sh
    - --num_batches=50000
    - --batch_size=8
    workingDir: /root

kubectl get pod -A

NAMESPACE     NAME                                       READY   STATUS    RESTARTS   AGE
default       sharepod1                                  1/1     Running   0          3m39s
kube-system   calico-kube-controllers-7854b85cf7-sd5fw   1/1     Running   0          2d10h
kube-system   calico-node-ccdcp                          1/1     Running   0          2d10h
kube-system   coredns-54d67798b7-f5fv8                   1/1     Running   0          2d10h
kube-system   coredns-54d67798b7-rlvhg                   1/1     Running   0          2d10h
kube-system   etcd-k8s-master                            1/1     Running   0          2d10h
kube-system   kube-apiserver-k8s-master                  1/1     Running   0          2d10h
kube-system   kube-controller-manager-k8s-master         1/1     Running   0          2d10h
kube-system   kube-proxy-lz6jn                           1/1     Running   0          2d10h
kube-system   kube-scheduler-k8s-master                  1/1     Running   0          2d10h
kube-system   kubeshare-device-manager                   1/1     Running   0          2d10h
kube-system   kubeshare-node-daemon-f58tc                2/2     Running   0          2d10h
kube-system   kubeshare-scheduler                        1/1     Running   0          2d10h
kube-system   kubeshare-vgpu-k8s-master-gzwvx            1/1     Running   0          3m40s
kube-system   nvidia-device-plugin-daemonset-twghw       1/1     Running   0          2d10h

kubectl logs sharepod1 seems to be working

INFO:tensorflow:Running local_init_op.
I1010 01:02:41.913408 140019771205440 session_manager.py:505] Running local_init_op.
INFO:tensorflow:Done running local_init_op.
I1010 01:02:41.943301 140019771205440 session_manager.py:508] Done running local_init_op.
2022-10-10 01:02:42.579839: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2022-10-10 01:04:04.418663: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2022-10-10 01:17:41.610889: W tensorflow/stream_executor/gpu/redzone_allocator.cc:314] Internal: ptxas exited with non-zero error code 65280, output: ptxas fatal   : Value 'sm_86' is not defined for option 'gpu-name'

Relying on driver to perform ptx compilation. 
Modify $PATH to customize ptxas location.
This message will be only logged once.
TensorFlow:  2.2
Model:       resnet50
Dataset:     imagenet (synthetic)
Mode:        training
SingleSess:  False
Batch size:  8 global
             8 per device
Num batches: 50000
Num epochs:  0.31
Devices:     ['/gpu:0']
NUMA bind:   False
Data format: NCHW
Optimizer:   sgd
Variables:   parameter_server
==========
Generating training model
Initializing graph
Running warm up
Done warm up
Time    Step    Img/sec total_loss
2022-10-10 01:18        1       images/sec: 354.7 +/- 0.0 (jitter = 0.0)        nan
2022-10-10 01:18        10      images/sec: 355.0 +/- 0.4 (jitter = 0.8)        nan
2022-10-10 01:18        20      images/sec: 354.8 +/- 0.3 (jitter = 1.3)        nan
2022-10-10 01:18        30      images/sec: 354.7 +/- 0.2 (jitter = 1.2)        nan
2022-10-10 01:18        40      images/sec: 354.7 +/- 0.2 (jitter = 1.2)        nan
2022-10-10 01:18        50      images/sec: 66.9 +/- 7.0 (jitter = 1.4) nan
2022-10-10 01:18        60      images/sec: 77.3 +/- 5.8 (jitter = 1.4) nan
2022-10-10 01:18        70      images/sec: 87.0 +/- 5.0 (jitter = 1.4) nan
2022-10-10 01:18        80      images/sec: 96.0 +/- 4.4 (jitter = 1.3) nan
2022-10-10 01:18        90      images/sec: 104.5 +/- 3.9 (jitter = 1.4)        nan
2022-10-10 01:18        100     images/sec: 112.4 +/- 3.5 (jitter = 1.4)        nan
2022-10-10 01:18        110     images/sec: 119.8 +/- 3.2 (jitter = 1.5)        nan
2022-10-10 01:18        120     images/sec: 126.8 +/- 2.9 (jitter = 1.3)        nan
2022-10-10 01:18        130     images/sec: 133.4 +/- 2.7 (jitter = 1.4)        nan
2022-10-10 01:19        140     images/sec: 139.6 +/- 2.5 (jitter = 1.4)        nan
2022-10-10 01:19        150     images/sec: 145.5 +/- 2.4 (jitter = 1.5)        nan
2022-10-10 01:19        160     images/sec: 151.0 +/- 2.2 (jitter = 1.5)        nan
2022-10-10 01:19        170     images/sec: 156.3 +/- 2.1 (jitter = 1.5)        nan

However, the annotation of resource isolation does not seem to be effective

nvidia-smi on host

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.73.05    Driver Version: 510.73.05    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  Off |
| 52%   81C    P2   328W / 450W |   8409MiB / 24564MiB |     96%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     83363      C   python                           8407MiB |
+-----------------------------------------------------------------------------+

nvidia-smi in the sharepod1

root@sharepod1:~# nvidia-smi
Mon Oct 10 01:43:31 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.73.05    Driver Version: 510.73.05    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  Off |
| 88%   83C    P2   332W / 450W |   8409MiB / 24564MiB |     95%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

I want to ask what the problem is

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant