-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Proposal] Add podgroup statistics doc #3750
Conversation
/cc @hwdef @Monokaix @lowang-bh Please take a look~ I will push the other relating pr lately. |
Should also add that queue's pg statistics are still maintained in queue cache because these data is needed when close a queue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is nothing wrong with this implementation logic
docs/design/podgroup-statistics.md
Outdated
|
||
## Backgrounds | ||
|
||
Each time when podgroups states changed, the controller will update the statistics of podgroup of each state in the queue's status. And at the end of each scheduling session, the volcano scheduler will also update the allocated filed in queue's status to recored the amount of the amount of resources allocated. Both components use `UpdateStatus` api to update the queue status, which will cause conflict errors. When the controller encounter such an error, it will trigger `AddRateLimited` to push back the podgroup into work queue, resulting in accumulation of memory leak. See in issue #3597. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add link for 3597
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
docs/design/podgroup-statistics.md
Outdated
Each time when podgroups states changed, the controller will update the statistics of podgroup of each state in the queue's status. And at the end of each scheduling session, the volcano scheduler will also update the allocated filed in queue's status to recored the amount of the amount of resources allocated. Both components use `UpdateStatus` api to update the queue status, which will cause conflict errors. When the controller encounter such an error, it will trigger `AddRateLimited` to push back the podgroup into work queue, resulting in accumulation of memory leak. See in issue #3597. | ||
|
||
## Alternative | ||
Currently the statistics of podgroup of eatch state are only used for display by vcctl, there is no need to be persisted in queue's status. So when users need to use `vcctl queue get -n [name]` or `vcctl list` to display queues and each state of podgroups in queue, vcctl should calculate podgroup statistics in client side and then display them. And we can export these statistics of podgroups in each state as metrics. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is eatch
mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
each state means podgroups in pending/running/unknown/inqueue/completed status
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh sorry, didn't notice that there was a typo, now I have fixed it.
64a1556
to
8944bfd
Compare
I added a notice item to record this, please review it again. |
8944bfd
to
1fcfa91
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/ok-to-test
/lgtm
docs/design/podgroup-statistics.md
Outdated
Currently the statistics of podgroups of each state are only used for display by vcctl, there is no need to be persisted in queue's status. So when users need to use `vcctl queue get -n [name]` or `vcctl list` to display queues and each state of podgroups in queue, vcctl should calculate podgroup statistics in client side and then display them. And we can export these statistics of podgroups in each state as metrics. | ||
|
||
## Implementation | ||
- In `syncQueue` of queue controller, we should not stat counts of podgroups in each state here, instead these statistics should be recorded as metrics and then be exported outside: https://github.com/volcano-sh/volcano/blob/release-1.10/pkg/controllers/queue/queue_controller_action.go#L41-L61. And `UpdateStatus` should not be used here: https://github.com/volcano-sh/volcano/blob/release-1.10/pkg/controllers/queue/queue_controller_action.go#L84-L87, the `UpdateStatus` interface will verify the resourceVersion in the apiserver, which may cause concurrent update conflicts. Instead, `ApplyStatus` should be used here to avoid this situation, because we only need to update the status of the queue. It should be noted that the controller currently does not have patch queue status permissions, so we should add a patch queue/status permission to the clusterrole. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please paste a permanent code link.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Signed-off-by: jessestutler <[email protected]>
1fcfa91
to
655a3f4
Compare
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: Monokaix The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
fix #3597
Backgrounds
Each time when podgroups states changed, the controller will update the statistics of podgroup of each state in the queue's status. And at the end of each scheduling session, the volcano scheduler will also update the allocated filed in queue's status to recored the amount of the amount of resources allocated. Both components use
UpdateStatus
api to update the queue status, which will cause conflict errors. When the controller encounter such an error, it will triggerAddRateLimited
to push back the podgroup into work queue, resulting in accumulation of memory leak. See in issue #3597.Alternative
Currently the statistics of podgroup of eatch state are only used for display by vcctl, there is no need to be persisted in queue's status. So when users need to use
vcctl queue get -n [name]
orvcctl list
to display queues and each state of podgroups in queue, vcctl should calculate podgroup statistics in client side and then display them. And we can export these statistics of podgroups in each state as metrics.Implementation
syncQueue
of queue controller, we should not stat counts of podgroups in each state here, instead these statistics should recorded as metrics and then be exported outside: https://github.com/volcano-sh/volcano/blob/release-1.10/pkg/controllers/queue/queue_controller_action.go#L41-L61. AndUpdateStatus
should not be used here: https://github.com/volcano-sh/volcano/blob/release-1.10/pkg/controllers/queue/queue_controller_action.go#L84-L87, theUpdateStatus
interface will verify the resourceVersion in the apiserver, which may cause concurrent update conflicts. Instead,ApplyStatus
should be used here to avoid this situation, because we only need to update the status of the queue. It should be noted that the controller currently does not have patch queue status permissions, so we should add a patch queue/status permission to the clusterrole.vcctl get -n [name]
andvcctl list
display the statistics of podgroups in each state from queue's status directly, instead we should do one more step, query the podgroups owend in the queue, stat the counts of podgroups in each state at thevcctl
side, and then display them.