Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

shaibi/RUN 16744 support kwok #66

Closed
wants to merge 7 commits into from
Closed

Conversation

gshaibi
Copy link
Contributor

@gshaibi gshaibi commented Mar 18, 2024

  • KWOK support Design
  • samples
  • .
  • .

- RunAI's Node Exporter
- The current deployment as a DaemonSet is incompatible with fake nodes.
- Device Plugin
- The current deployment as a DaemonSet is incompatible with fake nodes. We might want to not supoprt it on fake nodes and require manual node resources update.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this different than the others? what is the manner of the manual change you mention here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it is easier to replace manually.
Replacing it manually means editing the capacity and allocatable sections of nodes.
Replacing the status exporter will require exporting metrics and labels manually which is harder.

design/KWOK.md Outdated
- The current deployment as a DaemonSet is incompatible with fake nodes. We might want to not supoprt it on fake nodes and require manual node resources update.

## Design
- [ ] Implement a single monolithic service named `status-exporter` to handle all exportation logic when GPU nodes are fake. This service will be disabled be default, and will be manually enabled when running on kwok cluster. This service will encompass the following:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

be -> by
also fake-status-exporter / fake-node-status-exporter?

Copy link
Contributor Author

@gshaibi gshaibi Mar 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the services on the Fake GPU Operator are fakes, so I didn't append a fake prefix to each of their names

design/KWOK.md Outdated
Comment on lines 26 to 39
- [ ] Metrics
- [ ] Export the same as today, with the following label enrichments (<pod> refers to the dcgm-exporter fake pod):
- [ ] `container="nvidia-dcgm-exporter"`
- [ ] `instance="<pod-ip>:9400"`
- [ ] `job="nvidia-dcgm-exporter"`
- [ ] `pod="<pod-name>"`
- [ ] `service="nvidia-dcgm-exporter"`
- [ ] FileSystem
- [ ] Directly export Node Exporter's metrics instead of exporting to the FileSystem, including:
- [ ] `runai_pod_gpu_utilization` with labels `pod_uuid` and `gpu`
- [ ] `runai_pod_gpu_memory_used_bytes` with labels `pod_uuid` and `gpu`
- [ ] Labels
- [ ] Ensure consistent label exportation.
- [ ] Add a ServiceMonitor for the new service, and set `honorLabels: true` on it (so we can fake multiple exporters).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this section was too technical for me to read and digest without frontal explanation

- [ ] Ensure consistent label exportation.
- [ ] Add a ServiceMonitor for the new service, and set `honorLabels: true` on it (so we can fake multiple exporters).

## Limitations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only on fake nodes right? since we do support mig and nvidia-smi in the fake gpu overall afaik

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, of course.

@gshaibi gshaibi closed this Apr 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants