[Proposal] Add podgroup statistics doc #3750

JesseStutler · 2024-09-26T08:04:33Z

Backgrounds

Each time when podgroups states changed, the controller will update the statistics of podgroup of each state in the queue's status. And at the end of each scheduling session, the volcano scheduler will also update the allocated filed in queue's status to recored the amount of the amount of resources allocated. Both components use UpdateStatus api to update the queue status, which will cause conflict errors. When the controller encounter such an error, it will trigger AddRateLimited to push back the podgroup into work queue, resulting in accumulation of memory leak. See in issue #3597.

Alternative

Currently the statistics of podgroup of eatch state are only used for display by vcctl, there is no need to be persisted in queue's status. So when users need to use vcctl queue get -n [name] or vcctl list to display queues and each state of podgroups in queue, vcctl should calculate podgroup statistics in client side and then display them. And we can export these statistics of podgroups in each state as metrics.

Implementation

In syncQueue of queue controller, we should not stat counts of podgroups in each state here, instead these statistics should recorded as metrics and then be exported outside: https://github.com/volcano-sh/volcano/blob/release-1.10/pkg/controllers/queue/queue_controller_action.go#L41-L61. And UpdateStatus should not be used here: https://github.com/volcano-sh/volcano/blob/release-1.10/pkg/controllers/queue/queue_controller_action.go#L84-L87, the UpdateStatus interface will verify the resourceVersion in the apiserver, which may cause concurrent update conflicts. Instead, ApplyStatus should be used here to avoid this situation, because we only need to update the status of the queue. It should be noted that the controller currently does not have patch queue status permissions, so we should add a patch queue/status permission to the clusterrole.
vcctl get -n [name] and vcctl list display the statistics of podgroups in each state from queue's status directly, instead we should do one more step, query the podgroups owend in the queue, stat the counts of podgroups in each state at the vcctl side, and then display them.

JesseStutler · 2024-09-26T08:06:13Z

/cc @hwdef @Monokaix @lowang-bh Please take a look~ I will push the other relating pr lately.

Monokaix · 2024-09-27T01:38:29Z

Should also add that queue's pg statistics are still maintained in queue cache because these data is needed when close a queue.

hwdef

I think there is nothing wrong with this implementation logic

hwdef · 2024-09-30T06:34:10Z

docs/design/podgroup-statistics.md

+
+## Backgrounds
+
+Each time when podgroups states changed, the controller will update the statistics of podgroup of each state in the queue's status. And at the end of each scheduling session, the volcano scheduler will also update the allocated filed in queue's status to recored the amount of the amount of resources allocated. Both components use `UpdateStatus` api to update the queue status, which will cause conflict errors. When the controller encounter such an error, it will trigger `AddRateLimited` to push back the podgroup into work queue, resulting in accumulation of memory leak. See in issue #3597.


please add link for 3597

hwdef · 2024-09-30T06:35:33Z

docs/design/podgroup-statistics.md

+Each time when podgroups states changed, the controller will update the statistics of podgroup of each state in the queue's status. And at the end of each scheduling session, the volcano scheduler will also update the allocated filed in queue's status to recored the amount of the amount of resources allocated. Both components use `UpdateStatus` api to update the queue status, which will cause conflict errors. When the controller encounter such an error, it will trigger `AddRateLimited` to push back the podgroup into work queue, resulting in accumulation of memory leak. See in issue #3597.
+
+## Alternative
+Currently the statistics of podgroup of eatch state are only used for display by vcctl, there is no need to be persisted in queue's status. So when users need to use `vcctl queue get -n [name]` or `vcctl list` to display queues and each state of podgroups in queue, vcctl should calculate podgroup statistics in client side and then display them. And we can export these statistics of podgroups in each state as metrics.  


what is eatch mean?

each state means podgroups in pending/running/unknown/inqueue/completed status

I mean there is a typo

Oh sorry, didn't notice that there was a typo, now I have fixed it.

JesseStutler · 2024-10-24T02:46:21Z

Should also add that queue's pg statistics are still maintained in queue cache because these data is needed when close a queue.

I added a notice item to record this, please review it again.

hwdef

/ok-to-test
/lgtm

Monokaix · 2024-10-24T07:32:00Z

docs/design/podgroup-statistics.md

+Currently the statistics of podgroups of each state are only used for display by vcctl, there is no need to be persisted in queue's status. So when users need to use `vcctl queue get -n [name]` or `vcctl list` to display queues and each state of podgroups in queue, vcctl should calculate podgroup statistics in client side and then display them. And we can export these statistics of podgroups in each state as metrics.  
+
+## Implementation
+- In `syncQueue` of queue controller, we should not stat counts of podgroups in each state here, instead these statistics should be recorded as metrics and then be exported outside: https://github.com/volcano-sh/volcano/blob/release-1.10/pkg/controllers/queue/queue_controller_action.go#L41-L61. And `UpdateStatus` should not be used here: https://github.com/volcano-sh/volcano/blob/release-1.10/pkg/controllers/queue/queue_controller_action.go#L84-L87, the `UpdateStatus` interface will verify the resourceVersion in the apiserver, which may cause concurrent update conflicts. Instead, `ApplyStatus` should be used here to avoid this situation, because we only need to update the status of the queue. It should be noted that the controller currently does not have patch queue status permissions, so we should add a patch queue/status permission to the clusterrole.


Please paste a permanent code link.

Signed-off-by: jessestutler <[email protected]>

Monokaix · 2024-10-26T06:42:27Z

/lgtm
/approve

volcano-sh-bot · 2024-10-26T06:42:35Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Monokaix

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [Monokaix]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

volcano-sh-bot added the retest-not-required-docs-only label Sep 26, 2024

volcano-sh-bot requested review from lowang-bh and yuanchen8911 September 26, 2024 08:04

volcano-sh-bot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Sep 26, 2024

JesseStutler mentioned this pull request Sep 26, 2024

feature: Add podgroups statistics #3751

Merged

hwdef reviewed Sep 30, 2024

View reviewed changes

JesseStutler force-pushed the czc_dev_doc branch 2 times, most recently from 64a1556 to 8944bfd Compare October 24, 2024 02:43

JesseStutler force-pushed the czc_dev_doc branch from 8944bfd to 1fcfa91 Compare October 24, 2024 06:17

hwdef reviewed Oct 24, 2024

View reviewed changes

volcano-sh-bot assigned hwdef Oct 24, 2024

volcano-sh-bot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. lgtm Indicates that a PR is ready to be merged. labels Oct 24, 2024

Monokaix reviewed Oct 24, 2024

View reviewed changes

Add podgroup statistics proposal

655a3f4

Signed-off-by: jessestutler <[email protected]>

JesseStutler force-pushed the czc_dev_doc branch from 1fcfa91 to 655a3f4 Compare October 26, 2024 03:04

volcano-sh-bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed lgtm Indicates that a PR is ready to be merged. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Oct 26, 2024

volcano-sh-bot assigned Monokaix Oct 26, 2024

volcano-sh-bot added the lgtm Indicates that a PR is ready to be merged. label Oct 26, 2024

volcano-sh-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 26, 2024

volcano-sh-bot merged commit 4a7c823 into volcano-sh:master Oct 26, 2024
18 checks passed

JesseStutler mentioned this pull request Nov 15, 2024

Request to be a member of Volcano community volcano-sh/community#62

Closed

Monokaix added this to the v2.0 milestone Dec 5, 2024

JesseStutler mentioned this pull request Dec 23, 2024

queue.status.Running do not increased when create vcjob and pod is running successed. #3911

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Proposal] Add podgroup statistics doc #3750

[Proposal] Add podgroup statistics doc #3750

JesseStutler commented Sep 26, 2024

JesseStutler commented Sep 26, 2024

Monokaix commented Sep 27, 2024

hwdef left a comment

hwdef Sep 30, 2024

JesseStutler Oct 24, 2024

hwdef Sep 30, 2024

JesseStutler Oct 24, 2024

hwdef Oct 24, 2024

JesseStutler Oct 24, 2024

JesseStutler commented Oct 24, 2024

hwdef left a comment

Monokaix Oct 24, 2024

JesseStutler Oct 26, 2024

Monokaix commented Oct 26, 2024

volcano-sh-bot commented Oct 26, 2024


		## Backgrounds

		Each time when podgroups states changed, the controller will update the statistics of podgroup of each state in the queue's status. And at the end of each scheduling session, the volcano scheduler will also update the allocated filed in queue's status to recored the amount of the amount of resources allocated. Both components use `UpdateStatus` api to update the queue status, which will cause conflict errors. When the controller encounter such an error, it will trigger `AddRateLimited` to push back the podgroup into work queue, resulting in accumulation of memory leak. See in issue #3597.

[Proposal] Add podgroup statistics doc #3750

[Proposal] Add podgroup statistics doc #3750

Conversation

JesseStutler commented Sep 26, 2024

Backgrounds

Alternative

Implementation

JesseStutler commented Sep 26, 2024

Monokaix commented Sep 27, 2024

hwdef left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JesseStutler commented Oct 24, 2024

hwdef left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Monokaix commented Oct 26, 2024

volcano-sh-bot commented Oct 26, 2024