-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
shaibi/RUN 16744 support kwok #66
Conversation
gshaibi
commented
Mar 18, 2024
- KWOK support Design
- samples
- .
- .
- RunAI's Node Exporter | ||
- The current deployment as a DaemonSet is incompatible with fake nodes. | ||
- Device Plugin | ||
- The current deployment as a DaemonSet is incompatible with fake nodes. We might want to not supoprt it on fake nodes and require manual node resources update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this different than the others? what is the manner of the manual change you mention here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since it is easier to replace manually.
Replacing it manually means editing the capacity and allocatable sections of nodes.
Replacing the status exporter will require exporting metrics and labels manually which is harder.
design/KWOK.md
Outdated
- The current deployment as a DaemonSet is incompatible with fake nodes. We might want to not supoprt it on fake nodes and require manual node resources update. | ||
|
||
## Design | ||
- [ ] Implement a single monolithic service named `status-exporter` to handle all exportation logic when GPU nodes are fake. This service will be disabled be default, and will be manually enabled when running on kwok cluster. This service will encompass the following: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
be -> by
also fake-status-exporter
/ fake-node-status-exporter
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the services on the Fake GPU Operator are fakes, so I didn't append a fake
prefix to each of their names
design/KWOK.md
Outdated
- [ ] Metrics | ||
- [ ] Export the same as today, with the following label enrichments (<pod> refers to the dcgm-exporter fake pod): | ||
- [ ] `container="nvidia-dcgm-exporter"` | ||
- [ ] `instance="<pod-ip>:9400"` | ||
- [ ] `job="nvidia-dcgm-exporter"` | ||
- [ ] `pod="<pod-name>"` | ||
- [ ] `service="nvidia-dcgm-exporter"` | ||
- [ ] FileSystem | ||
- [ ] Directly export Node Exporter's metrics instead of exporting to the FileSystem, including: | ||
- [ ] `runai_pod_gpu_utilization` with labels `pod_uuid` and `gpu` | ||
- [ ] `runai_pod_gpu_memory_used_bytes` with labels `pod_uuid` and `gpu` | ||
- [ ] Labels | ||
- [ ] Ensure consistent label exportation. | ||
- [ ] Add a ServiceMonitor for the new service, and set `honorLabels: true` on it (so we can fake multiple exporters). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this section was too technical for me to read and digest without frontal explanation
- [ ] Ensure consistent label exportation. | ||
- [ ] Add a ServiceMonitor for the new service, and set `honorLabels: true` on it (so we can fake multiple exporters). | ||
|
||
## Limitations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only on fake nodes right? since we do support mig and nvidia-smi in the fake gpu overall afaik
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, of course.