Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The initialization time is too long during mnist test #170

Open
Natelu opened this issue Nov 23, 2022 · 0 comments
Open

The initialization time is too long during mnist test #170

Natelu opened this issue Nov 23, 2022 · 0 comments

Comments

@Natelu
Copy link

Natelu commented Nov 23, 2022

Initializing from Creating TensorFlow device to task running in my training session of mnist takes too much time(about 5mins to ready)

2022-11-23 08:15:22.173334: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Device peer to peer matrix
2022-11-23 08:15:22.173363: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1051] DMA: 0 1
2022-11-23 08:15:22.173375: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1061] 0:   Y Y
2022-11-23 08:15:22.173384: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1061] 1:   Y Y
2022-11-23 08:15:22.173402: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla V100-PCIE-16GB, pci bus id: 0000:3b:00.0, compute capability: 7.0)
2022-11-23 08:15:22.173450: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:1) -> (device: 1, name: Tesla V100-PCIE-16GB, pci bus id: 0000:44:00.0, compute capability: 7.0)
[-------------COST ABOUT 2mins ---------------------]
Initialized!
[-------------COST ABOUT 3mins ---------------------]
Step 0 (epoch 0.00), 2118.7 ms
Minibatch loss: 8.334, learning rate: 0.010000
Minibatch error: 85.9%
Validation error: 84.5%
duration between initialized and running is %d s 210.556521893
duration between initialized and running is %d s 210.559849024
duration between initialized and running is %d s 210.563081026

Base environment

Device: Tesla V100-PCIE-16GB; Driver Version: 470.141.03 CUDA Version: 11.4

System ENV

KUBE: v1.23.10
RUNC: 1.1.1
Containerd: v1.6.4
OS Kernel: Linux 3.10.0-1160.el7.x86_64
OS version: CentOS Linux 7 (Core)
CPU: Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz
Pod Resource:

kind: Deployment
metadata:
  labels:
    k8s-app: vcuda-test
    qcloud-app: vcuda-test
  name: vcuda-test
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      k8s-app: vcuda-test
  template:
    metadata:
      labels:
        k8s-app: vcuda-test
        qcloud-app: vcuda-test
    spec:
      containers:
      - command:
        - sleep
        - 360000s
        env:
        - name: PATH
          value: /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
        image: <internal-repository>/tensorflow-gputest:0.2
        imagePullPolicy: IfNotPresent
        name: tensorflow-test
        resources:
          limits:
            cpu: "4"
            memory: 8Gi
            tencent.com/vcuda-core: "200"
            tencent.com/vcuda-memory: "30"
          requests:
            cpu: "4"
            memory: 8Gi
            tencent.com/vcuda-core: "200"
            tencent.com/vcuda-memory: "30"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant