Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dcgm-exporter daemonset Startup error Failed to pass the health check #393

Open
guoliangmiao opened this issue Sep 26, 2024 · 2 comments
Open
Labels
question Further information is requested

Comments

@guoliangmiao
Copy link

Ask your question

apiVersion: apps/v1
kind: DaemonSet
metadata:
name: dcgm-exporter
namespace: monitoring
spec:
revisionHistoryLimit: 10
selector:
matchLabels:
app.kubernetes.io/component: dcgm-exporter
app.kubernetes.io/instance: dcgm-exporter
app.kubernetes.io/name: dcgm-exporter
template:
metadata:
labels:
app.kubernetes.io/component: dcgm-exporter
app.kubernetes.io/instance: dcgm-exporter
app.kubernetes.io/name: dcgm-exporter
namespace: monitoring
spec:
containers:
- args:
- '-f'
- /etc/dcgm-exporter/dcp-metrics-included.csv
env:
- name: DCGM_EXPORTER_KUBERNETES
value: 'true'
- name: DCGM_EXPORTER_LISTEN
value: ':9400'
- name: DCGM_EXPORTER_DEBUG
value: 'true'
image: >-
crpi-y1mch6v3dn8zzsm8.cn-hangzhou.personal.cr.aliyuncs.com/cloudsway/dcgm-exporter:3.3.8-3.6.0-ubuntu22.04
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 3
httpGet:
path: /health
port: 9400
scheme: HTTP
initialDelaySeconds: 5
periodSeconds: 5
successThreshold: 1
timeoutSeconds: 1
name: exporter
ports:
- containerPort: 9400
name: metrics
protocol: TCP
readinessProbe:
failureThreshold: 3
httpGet:
path: /health
port: 9400
scheme: HTTP
initialDelaySeconds: 5
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
resources:
requests:
cpu: '1'
memory: 2Gi
securityContext:
capabilities:
add:
- SYS_ADMIN
runAsNonRoot: false
runAsUser: 0
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/lib/kubelet/pod-resources
name: pod-gpu-resources
readOnly: true

=================================================================================================
2024/09/26 04:12:34 maxprocs: Leaving GOMAXPROCS=192: CPU quota undefined
2024-09-26T12:12:34.555993494+08:00 time="2024-09-26T04:12:34Z" level=info msg="Starting dcgm-exporter"
2024-09-26T12:12:34.556005237+08:00 time="2024-09-26T04:12:34Z" level=debug msg="Debug output is enabled"
2024-09-26T12:12:34.556632224+08:00 time="2024-09-26T04:12:34Z" level=debug msg="Command line: /usr/bin/dcgm-exporter -f /etc/dcgm-exporter/dcp-metrics-included.csv"
time="2024-09-26T04:12:34Z" level=debug msg="Loaded configuration" dump="&{CollectorsFile:/etc/dcgm-exporter/dcp-metrics-included.csv Address::9400 CollectInterval:30000 Kubernetes:true KubernetesGPUIdType:uid CollectDCP:true UseOldNamespace:false UseRemoteHE:false RemoteHEInfo:localhost:5555 GPUDevices:{Flex:true MajorRange:[] MinorRange:[]} SwitchDevices:{Flex:true MajorRange:[] MinorRange:[]} CPUDevices:{Flex:true MajorRange:[] MinorRange:[]} NoHostname:false UseFakeGPUs:false ConfigMapData:none MetricGroups:[] WebSystemdSocket:false WebConfigFile: XIDCountWindowSize:300000 ReplaceBlanksInModelName:false Debug:true ClockEventsCountWindowSize:300000 EnableDCGMLog:false DCGMLogLevel:NONE PodResourcesKubeletSocket:/var/lib/kubelet/pod-resources/kubelet.sock HPCJobMappingDir: NvidiaResourceNames:[]}"
time="2024-09-26T04:12:34Z" level=info msg="DCGM successfully initialized!"
2024-09-26T12:12:34.948180836+08:00 time="2024-09-26T04:12:34Z" level=info msg="Not collecting DCP metrics: This request is serviced by a module of DCGM that is not currently loaded"
2024-09-26T12:12:34.948188060+08:00 time="2024-09-26T04:12:34Z" level=info msg="Falling back to metric file '/etc/dcgm-exporter/dcp-metrics-included.csv'"
2024-09-26T12:12:34.948330476+08:00 time="2024-09-26T04:12:34Z" level=warning msg="Skipping line 20 ('DCGM_FI_PROF_GR_ENGINE_ACTIVE'): metric not enabled"
2024-09-26T12:12:34.948341428+08:00 time="2024-09-26T04:12:34Z" level=warning msg="Skipping line 21 ('DCGM_FI_PROF_PIPE_TENSOR_ACTIVE'): metric not enabled"
2024-09-26T12:12:34.948345155+08:00 time="2024-09-26T04:12:34Z" level=warning msg="Skipping line 22 ('DCGM_FI_PROF_DRAM_ACTIVE'): metric not enabled"
2024-09-26T12:12:34.948359042+08:00 time="2024-09-26T04:12:34Z" level=warning msg="Skipping line 23 ('DCGM_FI_PROF_PCIE_TX_BYTES'): metric not enabled"
time="2024-09-26T04:12:34Z" level=warning msg="Skipping line 24 ('DCGM_FI_PROF_PCIE_RX_BYTES'): metric not enabled"
2024-09-26T12:12:34.948489134+08:00 time="2024-09-26T04:12:34Z" level=info msg="Initializing system entities of type: GPU"
time="2024-09-26T04:12:35Z" level=debug msg="System entities of type GPU initialized"
2024-09-26T12:12:35.285853835+08:00 time="2024-09-26T04:12:35Z" level=info msg="Not collecting NvSwitch metrics; no fields to watch for device type: 3"
time="2024-09-26T04:12:35Z" level=info msg="Not collecting NvLink metrics; no fields to watch for device type: 6"
2024-09-26T12:12:35.285870137+08:00 time="2024-09-26T04:12:35Z" level=info msg="Not collecting CPU metrics; no fields to watch for device type: 7"
time="2024-09-26T04:12:35Z" level=info msg="Not collecting CPU Core metrics; no fields to watch for device type: 8"
time="2024-09-26T04:12:35Z" level=debug msg="Counters are initialized" dump="[{FieldID:100 FieldName:DCGM_FI_DEV_SM_CLOCK PromType:gauge Help:SM clock frequency (in MHz).} {FieldID:101 FieldName:DCGM_FI_DEV_MEM_CLOCK PromType:gauge Help:Memory clock frequency (in MHz).} {FieldID:140 FieldName:DCGM_FI_DEV_MEMORY_TEMP PromType:gauge Help:Memory temperature (in C).} {FieldID:150 FieldName:DCGM_FI_DEV_GPU_TEMP PromType:gauge Help:GPU temperature (in C).} {FieldID:155 FieldName:DCGM_FI_DEV_POWER_USAGE PromType:gauge Help:Power draw (in W).} {FieldID:156 FieldName:DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION PromType:counter Help:Total energy consumption since boot (in mJ).} {FieldID:202 FieldName:DCGM_FI_DEV_PCIE_REPLAY_COUNTER PromType:counter Help:Total number of PCIe retries.} {FieldID:203 FieldName:DCGM_FI_DEV_GPU_UTIL PromType:gauge Help:GPU utilization (in %).} {FieldID:204 FieldName:DCGM_FI_DEV_MEM_COPY_UTIL PromType:gauge Help:Memory utilization (in %).} {FieldID:206 FieldName:DCGM_FI_DEV_ENC_UTIL PromType:gauge Help:Encoder utilization (in %).} {FieldID:207 FieldName:DCGM_FI_DEV_DEC_UTIL PromType:gauge Help:Decoder utilization (in %).} {FieldID:230 FieldName:DCGM_FI_DEV_XID_ERRORS PromType:gauge Help:Value of the last XID error encountered.} {FieldID:251 FieldName:DCGM_FI_DEV_FB_FREE PromType:gauge Help:Framebuffer memory free (in MiB).} {FieldID:252 FieldName:DCGM_FI_DEV_FB_USED PromType:gauge Help:Framebuffer memory used (in MiB).} {FieldID:449 FieldName:DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL PromType:counter Help:Total number of NVLink bandwidth counters for all lanes.} {FieldID:526 FieldName:DCGM_FI_DEV_VGPU_LICENSE_STATUS PromType:gauge Help:vGPU License status} {FieldID:393 FieldName:DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS PromType:counter Help:Number of remapped rows for uncorrectable errors} {FieldID:394 FieldName:DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS PromType:counter Help:Number of remapped rows for correctable errors} {FieldID:395 FieldName:DCGM_FI_DEV_ROW_REMAP_FAILURE PromT
ype:gauge Help:Whether remapping of rows has failed} {FieldID:1 FieldName:DCGM_FI_DRIVER_VERSION PromType:label Help:Driver Version}]"
time="2024-09-26T04:12:35Z" level=info msg="Kubernetes metrics collection enabled!"
2024-09-26T12:12:35.321315631+08:00 time="2024-09-26T04:12:35Z" level=info msg="Pipeline starting"
2024-09-26T12:12:35.321320811+08:00 time="2024-09-26T04:12:35Z" level=info msg="Starting webserver"
2024-09-26T12:12:35.321682302+08:00 time="2024-09-26T04:12:35Z" level=info msg="Listening on" address="[::]:9400"
2024-09-26T12:12:35.321690668+08:00 time="2024-09-26T04:12:35Z" level=info msg="TLS is disabled." address="[::]:9400" http2=false

=======================================================================================================
As shown above, these are the key parts of the deployment files and error logs. I have enabled debug mode. However, the root cause of the issue has not yet been analyzed. Please help me with this.

@guoliangmiao guoliangmiao added the question Further information is requested label Sep 26, 2024
@nvvfedorov
Copy link
Collaborator

@guoliangmiao , According to the log the dcgm-exporter started.

Also, I see, that performance metrics aren't supported on your hardware: "Not collecting DCP metrics: This request is serviced by a module of DCGM that is not currently loaded".

@guoliangmiao
Copy link
Author

I found that the solution to the problem is to add the runtime: nvidia field in the deployment file, but I don't quite understand this behavior because the default_container_runtime has already been specified as nvidia in the containerd configuration for each node.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants