Merge branch 'master' into volcano-guru

volcano-sh · Oct 26, 2024 · a672f02 · a672f02
2 parents ff36344 + 2ea1469
commit a672f02
Show file tree

Hide file tree

Showing 3 changed files with 111 additions and 34 deletions.
diff --git a/docs/design/metrics.md b/docs/design/metrics.md
@@ -1,39 +1,58 @@
 ## Scheduler Monitoring
 
 ## Introduction
-Currently users can leverage controller logs and job events to monitor scheduler. While useful for debugging, none of this options is particularly practical for monitoring kube-batch behaviour over time. There's also requirement like to monitor kube-batch in one view to resolve critical performance issue in time [#427](https://github.com/kubernetes-sigs/kube-batch/issues/427).
+Currently users can leverage controller logs and job events to monitor scheduler. While useful for debugging, none of these options is particularly practical for monitoring volcano behaviour over time. There's also requirement like to monitor volcano in one view to resolve critical performance issue in time [#427](https://github.com/kubernetes-sigs/kube-batch/issues/427).
 
-This document describes metrics we want to add into kube-batch to better monitor performance.
+This document describes metrics we want to add into volcano to better monitor performance.
 
 ## Metrics
-In order to support metrics, kube-batch needs to expose a metrics endpoint which can provide golang process metrics like number of goroutines, gc duration, cpu and memory usage, etc as well as kube-batch custom metrics related to time taken by plugins or actions.
-
-All the metrics are prefixed with `kube_batch_`.
-
-### kube-batch execution
-This metrics track execution of plugins and actions of kube-batch loop.
-
-| Metric name | Metric type | Labels | Description |
-| ----------- | ----------- | ------ | ----------- |
-| e2e_scheduling_latency | histogram |  | E2e scheduling latency in seconds |
-| plugin_latency | histogram | `plugin`=&lt;plugin_name&gt; | Schedule latency for plugin |
-| action_latency | histogram | `action`=&lt;action_name&gt; | Schedule latency for action |
-| task_latency | histogram | `job`=&lt;job_id&gt; `task`=&lt;task_id&gt; | Schedule latency for each task |
-
-
-### kube-batch operations
-This metrics describe internal state of kube-batch.
-
-| Metric name | Metric type | Labels | Description |
-| ----------- | ----------- | ------ | ----------- |
-| pod_schedule_errors | Counter |  | The number of kube-batch failed due to an error |
-| pod_schedule_successes | Counter | | The number of kube-batch success in scheduling a job |
-| pod_preemption_victims | Counter | | Number of selected preemption victims |
-| total_preemption_attempts | Counter |  | Total preemption attempts in the cluster till now |
-| unschedule_task_count | Counter | `job`=&lt;job_id&gt; | The number of tasks failed to schedule |
-| unschedule_job_counts | Counter | | The number of job failed to schedule in each iteration |
-| job_retry_counts | Counter | `job`=&lt;job_id&gt; | The number of retry times of one job |
-
-
-### kube-batch Liveness
-Healthcheck last time of kube-batch activity and timeout
+In order to support metrics, volcano needs to expose a metrics endpoint which can provide golang process metrics like number of goroutines, gc duration, cpu and memory usage, etc as well as volcano custom metrics related to time taken by plugins or actions.
+
+All the metrics are prefixed with `volcano_`.
+
+### volcano execution
+This metrics track execution of plugins and actions of volcano loop.
+
+| **Metric Name**                           | **Metric Type** | **Labels**                                                                                | **Description**                                                                |
+|-------------------------------------------|-----------------|-------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------|
+| `e2e_scheduling_latency_milliseconds`     | Histogram       | None                                                                                      | End-to-end scheduling latency in milliseconds (scheduling algorithm + binding) |
+| `e2e_job_scheduling_latency_milliseconds` | Histogram       | None                                                                                      | End-to-end job scheduling latency in milliseconds                              |
+| `e2e_job_scheduling_duration`             | Gauge           | `job_name`=&lt;job_name&gt;, `queue`=&lt;queue&gt;, `job_namespace`=&lt;job_namespace&gt; | End-to-end job scheduling duration                                             |
+| `e2e_job_scheduling_start_time`           | Gauge           | `job_name`=&lt;job_name&gt;, `queue`=&lt;queue&gt;, `job_namespace`=&lt;job_namespace&gt; | End-to-end job scheduling start time                                           |
+| `plugin_scheduling_latency_milliseconds`  | Histogram       | `plugin`=&lt;plugin_name&gt;, `OnSession`=&lt;OnSession&gt;                               | Plugin scheduling latency in milliseconds                                      |
+| `action_scheduling_latency_milliseconds`  | Histogram       | `action`=&lt;action_name&gt;                                                              | Action scheduling latency in milliseconds                                      |
+| `task_scheduling_latency_milliseconds`    | Histogram       | None                                                                                      | Task scheduling latency in milliseconds                                        |
+
+
+### volcano operations
+This metrics describe internal state of volcano.
+
+| **Metric Name**                 | **Metric Type** | **Labels**                                                  | **Description**                               |
+|---------------------------------|-----------------|-------------------------------------------------------------|-----------------------------------------------|
+| `schedule_attempts_total`       | Counter         | `result`=&lt;result&gt;                                     | The number of attempts to schedule pods       |
+| `pod_preemption_victims`        | Gauge           | None                                                        | The number of selected preemption victims     |
+| `total_preemption_attempts`     | Counter         | None                                                        | Total preemption attempts in the cluster      |
+| `unschedule_task_count`         | Gauge           | `job_id`=&lt;job_id&gt;                                     | The number of tasks failed to schedule        |
+| `unschedule_job_counts`         | Gauge           | None                                                        | The number of jobs could not be scheduled     |
+| `queue_allocated_milli_cpu`     | Gauge           | `queue_name`=&lt;queue_name&gt;                             | Allocated CPU count for one queue             |
+| `queue_allocated_memory_bytes`  | Gauge           | `queue_name`=&lt;queue_name&gt;                             | Allocated memory for one queue                |
+| `queue_request_milli_cpu`       | Gauge           | `queue_name`=&lt;queue_name&gt;                             | Requested CPU count for one queue             |
+| `queue_request_memory_bytes`    | Gauge           | `queue_name`=&lt;queue_name&gt;                             | Requested memory for one queue                |
+| `queue_deserved_milli_cpu`      | Gauge           | `queue_name`=&lt;queue_name&gt;                             | Deserved CPU count for one queue              |
+| `queue_deserved_memory_bytes`   | Gauge           | `queue_name`=&lt;queue_name&gt;                             | Deserved memory for one queue                 |
+| `queue_share`                   | Gauge           | `queue_name`=&lt;queue_name&gt;                             | Share for one queue                           |
+| `queue_weight`                  | Gauge           | `queue_name`=&lt;queue_name&gt;                             | Weight for one queue                          |
+| `queue_overused`                | Gauge           | `queue_name`=&lt;queue_name&gt;                             | Whether one queue is overused                 |
+| `queue_pod_group_inqueue_count` | Gauge           | `queue_name`=&lt;queue_name&gt;                             | The number of Inqueue PodGroups in this queue |
+| `queue_pod_group_pending_count` | Gauge           | `queue_name`=&lt;queue_name&gt;                             | The number of Pending PodGroups in this queue |
+| `queue_pod_group_running_count` | Gauge           | `queue_name`=&lt;queue_name&gt;                             | The number of Running PodGroups in this queue |
+| `queue_pod_group_unknown_count` | Gauge           | `queue_name`=&lt;queue_name&gt;                             | The number of Unknown PodGroups in this queue |
+| `namespace_share`               | Gauge           | `namespace_name`=&lt;namespace_name&gt;                     | Deserved CPU count for one namespace          |
+| `namespace_weight`              | Gauge           | `namespace_name`=&lt;namespace_name&gt;                     | Weight for one namespace                      |
+| `job_share`                     | Gauge           | `job_id`=&lt;job_id&gt;, `job_ns`=&lt;job_ns&gt;            | Share for one job                             |
+| `job_retry_counts`              | Counter         | `job_id`=&lt;job_id&gt;                                     | The number of retry counts for one job        |
+| `job_completed_phase_count`     | Counter         | `job_name`=&lt;job_name&gt; `queue_name`=&lt;queue_name&gt; | The number of job completed phase             |
+| `job_failed_phase_count`        | Counter         | `job_name`=&lt;job_name&gt; `queue_name`=&lt;queue_name&gt; | The number of job failed phase                |
+
+### volcano Liveness
+Healthcheck last time of volcano activity and timeout
diff --git a/docs/design/podgroup-statistics.md b/docs/design/podgroup-statistics.md
@@ -0,0 +1,58 @@
+# PodGroup Statistics
+
+## Backgrounds
+
+Each time when podgroups states changed, the controller will update the statistics of podgroup of each state in the queue's status. 
+And at the end of each scheduling session, the volcano scheduler will also update the allocated filed in queue's status to recored 
+the amount of the amount of resources allocated. Both components use `UpdateStatus` api to update the queue status, which will cause
+conflict errors. When the controller encounter such an error, it will trigger `AddRateLimited` to push back the podgroup into work queue, 
+resulting in accumulation of memory leak. See in issue #3597: https://github.com/volcano-sh/volcano/issues/3597.
+
+## Alternative
+Currently the statistics of podgroups of each state are only used for display by vcctl, there is no need to be persisted in queue's status. 
+So when users need to use `vcctl queue get -n [name]` or `vcctl list` to display queues and each state of podgroups in queue, 
+vcctl should calculate podgroup statistics in client side and then display them. And we can export these statistics of podgroups in each state as metrics.  
+
+## Implementation
+- In `syncQueue` of queue controller, counts of podgroups in each state will not be persisted in queue status anymore, 
+instead these statistics will be recorded as metrics and then be exported directly: https://github.com/volcano-sh/volcano/blob/c9be5c4c934597d99a0a80c9b26a3e919bbf8877/pkg/controllers/queue/queue_controller_action.go#L41-L61. 
+And `UpdateStatus` should not be used here: https://github.com/volcano-sh/volcano/blob/c9be5c4c934597d99a0a80c9b26a3e919bbf8877/pkg/controllers/queue/queue_controller_action.go#L84-L87, 
+the `UpdateStatus` interface will verify the resourceVersion in the apiserver, which may cause concurrent update conflicts. 
+Instead, `ApplyStatus` should be used here to avoid this situation, because we only need to update the status of the queue. 
+It should be noted that the controller currently does not have patch queue status permissions, so we should add a patch queue/status permission to the clusterrole.
+- `vcctl get -n [name]` and `vcctl list` display the statistics of podgroups in each state from queue's status directly, 
+instead we should do one more step, query the podgroups owend in the queue, stat the counts of podgroups in each state at the `vcctl` side, and then display them. 
+- Those metrics called `queue_pod_group_[state]_count`, which are recorded in proportion/capacity plugins in scheduler, 
+now will be moved to controller to record. The metrics involved are as follows:
+     - `queue_pod_group_inqueue_count`: The number of Inqueue PodGroup in this queue
+     - `queue_pod_group_pending_count`: The number of Pending PodGroup in this queue
+     - `queue_pod_group_running_count`: The number of Running PodGroup in this queue
+     - `queue_pod_group_unknown_count`: The number of Unknown PodGroup in this queue
+     - And a new metric is added to record the number of Completed PodGroup in the queue called `queue_pod_group_completed_count`
+
+## Notice
+- The statistical fields of podgroups in each state are still retained in queue status, but vc-controller no longer updates them. 
+**If the plugin written by the user relies on these fields in the queue status, the logic of the plugin needs to be modified before 
+upgrading to the latest version of volcano**. The user can refer to this code implementation to modify your plugin, detail of the code 
+can be found in pkg/cli/queue/get.go. `PodGroupStatistics` records the number of podgroups in each state, which is the same as the 
+previous fields in queue status. We need to check all podgroups first, query which podgroups are in the queue, and then record the statistics. 
+*If the user is using k8s v1.30+, you can enable the `CustomResourceFieldSelectors` featuregate to directly filter the podgroups contained 
+in the queue in kube-apiserver without having to query all podgroups*.
+```go
+pgList, err := queueClient.SchedulingV1beta1().PodGroups("").List(ctx, metav1.ListOptions{})
+if err != nil {
+    ...
+}
+
+pgStats := &PodGroupStatistics{}
+for _, pg := range pgList.Items {
+    if pg.Spec.Queue == queue.Name {
+        pgStats.StatPodGroupCountsForQueue(&pg)
+    }
+}
+```
+- The queue controller in vc-controller still has podgroup cache, the cache is still necessary to use these cache data to check whether the queue can be closed, 
+and to stat the number of podgroups in each state and export them as metrics.
+- The metrics called `queue_pod_group_inqueue_count`, `queue_pod_group_pending_count`, `queue_pod_group_running_count`, `queue_pod_group_unknown_count` 
+belonging to scheduler before now are moved to vc-controller instead. If the user still needs to query these metrics, 
+vc-controller needs to be added as the scrape address, so that there is no need to modify PromQL.
diff --git a/pkg/webhooks/admission/jobs/validate/admit_job.go b/pkg/webhooks/admission/jobs/validate/admit_job.go
@@ -297,7 +297,7 @@ func validateJobUpdate(old, new *v1alpha1.Job) error {
 	}
 
 	if !apiequality.Semantic.DeepEqual(new.Spec, old.Spec) {
-		return fmt.Errorf("job updates may not change fields other than `minAvailable`, `tasks[*].replicas under spec`")
+		return fmt.Errorf("job updates may not change fields other than `minAvailable`, `tasks[*].replicas under spec` and `PriorityClassName`")
 	}
 
 	return nil