Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use replicas when initializing pg minResources #4000

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

sceneryback
Copy link
Contributor

When creating pod using replicaSet, the minMember of podgroup created is always 1, this is not reasonable. Consider this case:

  1. we created a deployment with 2 replicas, totally requesting 16 gpus
  2. the podgroup created has 1 minMember and 8 gpus
  3. the capability of queue has 8 gpus and the scheduler configmap disables AllocatableFn of proportion plugin
  4. the podgroup will be enqueued and running since minResources(8 gpus) is not bigger than queue capability
  5. finally the queue allocated 16 gpus, more than its deserved

As can be seen from this picture:
image

In the case that pod has replicas from its owners, the podgroup should consider relicas, just as training-operator does, which creates pod group based on resources from all replicas

@volcano-sh-bot volcano-sh-bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Feb 7, 2025
@sceneryback
Copy link
Contributor Author

/assign @hzxuzhonghu

@hwdef
Copy link
Member

hwdef commented Feb 7, 2025

Same situation as mentioned in this issue:
#3970

@Monokaix
Copy link
Member

Monokaix commented Feb 7, 2025

proportion plugin controls the queue related capabilities, if you disable it, the capacity check will not take effect.

@Monokaix
Copy link
Member

Monokaix commented Feb 7, 2025

Just as @hwdef said, please set an annotation to control the minmember to set, and also other workload type like statefulset should also has this capability: )

@sceneryback
Copy link
Contributor Author

hi @hwdef @Monokaix thanks for all your suggestions, I have updated the feature using new annotation, so it won't break current behaviors. Please have a look.
apis change: volcano-sh/apis#154
I will update go.mod to delete local references of apis package after volcano-sh/apis#154 is merged


for _, reference := range pod.OwnerReferences {
if reference.Kind != "" && reference.Name != "" {
tmp := pg.getAnnotationsFromUpperRes(reference.Kind, reference.Name, pod.Namespace)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually we write annotation on the deployment to specify minMember, but here the ownerReferences of the Pod can only be traced back to the replicaset.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From k8s source code https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/deployment/util/deployment_util.go#L235, the ReplicaSet will inherit annotations from Deployment, so I think it's enough to get annotations from RS
image

for _, reference := range pod.OwnerReferences {
if reference.Kind != "" && reference.Name != "" {
tmp := pg.getAnnotationsFromUpperRes(reference.Kind, reference.Name, pod.Namespace)
if err := mergo.Merge(&annotations, &tmp); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have scenarios that Pod has multiple OwnerReferences? If there are trully multiple annotations, we won't specify twice the minMember annotations, do we need to use the merge.Merge?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Normally there won't be multiple OwnerReferences with this annotation, updated. Thanks

@@ -177,6 +181,37 @@ func (pg *pgcontroller) getAnnotationsFromUpperRes(kind string, name string, nam
}
}

func (pg *pgcontroller) getMinMemberFromUpperRes(pod *v1.Pod) *int32 {
defaultMinMember := int32(1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This variable name is very strange. You can set 1 as a constant (as defaultMinMember), and then the variable here is initialized to defaultMinMember. minMember here is ok

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

@@ -193,7 +228,7 @@ func (pg *pgcontroller) inheritUpperAnnotations(pod *v1.Pod, obj *scheduling.Pod
}
}

func (pg *pgcontroller) createNormalPodPGIfNotExist(pod *v1.Pod) error {
func (pg *pgcontroller) createNormalPodPGIfNotExist(pod *v1.Pod, minAvailable *int32) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are the variable names here not unified as minMember?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

@@ -203,6 +238,11 @@ func (pg *pgcontroller) createNormalPodPGIfNotExist(pod *v1.Pod) error {
return err
}

minMember := int32(1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no need to initialize again here, you already have the initialization value in getMinMemberFromUpperRes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

@@ -327,10 +327,10 @@ func (p TasksPriority) CalcFirstCountResources(count int32) v1.ResourceList {

for _, task := range p {
if count <= task.Replicas {
minReq = quotav1.Add(minReq, calTaskRequests(&v1.Pod{Spec: task.Template.Spec}, count))
minReq = quotav1.Add(minReq, *util.CalTaskRequests(&v1.Pod{Spec: task.Template.Spec}, count))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The return value of the function here is a pointer, and then you use * directly to get the value pointed to by the pointer, the way it's written looks a bit strange and doesn't quite fit in with go's clean code. The return value of the func CalTaskRequests does not need to be refactored, keeping the original is better. In fact, when the API was designed, minResources should not be a pointer, ResourceList is a map.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

simply copied from an existing util function:
image
I will update GetPodQuotaUsage too
Also notice that the minResources field of PodGroup is designed as pointer originally, so currently won't break this

@sceneryback
Copy link
Contributor Author

hi @JesseStutler Thanks for your reviewing! Have updated, please have a look

@Monokaix
Copy link
Member

API related pr is merged, you can move on: )

@sceneryback sceneryback force-pushed the pgcontroller_create_use_replicas branch from 4241911 to 05dd3e0 Compare February 12, 2025 01:00
@sceneryback
Copy link
Contributor Author

Updated go.mod, please have a look @Monokaix @hwdef @JesseStutler

Copy link
Member

@hwdef hwdef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@volcano-sh-bot volcano-sh-bot added the lgtm Indicates that a PR is ready to be merged. label Feb 12, 2025
@hwdef
Copy link
Member

hwdef commented Feb 12, 2025

/ok-to-test

@volcano-sh-bot volcano-sh-bot added the ok-to-test Indicates a non-member PR verified by an org member that is safe to test. label Feb 12, 2025
"k8s.io/kubernetes/pkg/apis/core/v1/helper"
quotacore "k8s.io/kubernetes/pkg/quota/v1/evaluator/core"
"k8s.io/utils/clock"
)

func GetPodQuotaUsage(pod *v1.Pod) *v1.ResourceList {
func GetPodQuotaUsage(pod *v1.Pod) v1.ResourceList {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why changed to non-poiner?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just follow this comment from @JesseStutler for better style #4000 (comment)

@@ -172,7 +172,8 @@ func (pg *pgcontroller) processNextReq() bool {

// normal pod use volcano
klog.V(4).Infof("Try to create podgroup for pod %s/%s", pod.Namespace, pod.Name)
if err := pg.createNormalPodPGIfNotExist(pod); err != nil {
minMember := pg.getMinMemberFromUpperRes(pod)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic should put in createNormalPodPGIfNotExist, because when the podgroup is already existed, there is no need to calculate the min member, it will request APIServer which is a heavy operation.
And there is also a qustion mentioned here #3970 (comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very good point 👍 so it's better to keep rs/sts etc informers in pgcontroller and read cache from these informers, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, DaemonSet and Job should also be considered.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also this should also be considered #3970 (comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also this should also be considered #3970 (comment)

I think it's not necessary to check pod annotation? The annotation is designed for workload, it will be strange to see this in pod

@volcano-sh-bot volcano-sh-bot removed the lgtm Indicates that a PR is ready to be merged. label Feb 12, 2025
@volcano-sh-bot
Copy link
Contributor

New changes are detected. LGTM label has been removed.

@volcano-sh-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign hwdef
You can assign the PR to them by writing /assign @hwdef in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@sceneryback sceneryback force-pushed the pgcontroller_create_use_replicas branch from 2495285 to 27e95ed Compare February 12, 2025 12:09
@volcano-sh-bot volcano-sh-bot removed the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Feb 12, 2025
@volcano-sh-bot volcano-sh-bot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Feb 12, 2025
@volcano-sh-bot volcano-sh-bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Feb 12, 2025
@sceneryback
Copy link
Contributor Author

Refactored code using informers, please have a look again @hwdef @Monokaix @JesseStutler

@hwdef
Copy link
Member

hwdef commented Feb 14, 2025

Hi, please make the CI happy

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants