Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TPU Provisioner: Node pool hash comparison #967

Conversation

nstogner
Copy link
Collaborator

@nstogner nstogner commented Feb 6, 2025

Issue:

Today the provisioner just checks if a node pool with the expected name exists for a given workload. If the workload is quickly recreated with a different node selector it is possible that the provisioner will not delete the old node pool in time and it will not think it needs to create a new one since the old one exists.

Fix:

This PR introduces new logic into the provisioner to check that the existing node pool matches the desired node pool via a hash comparison. Hash is calculated at node pool creation time and stored as a node pool label for later comparison.

Secondary issue & fix:

The reconciler responsible for node pool deletion used to watch for changes to Pods (such as being terminated off of a Node) in addition to changes to Nodes (ex: an update to the Node's status by kubelet) in order to know when to check whether a NodePool was free to delete. A historical PR removed the Pod watch causing a regression. As a result, changes to Node objects are now relied on - this can be slow as Node updates are infrequent (a Pod terminating does not result in an update to the Node object).

This secondary issue was fixed by re-adding the Pod watch into the deletion reconciler.

@nstogner nstogner force-pushed the tpu-provisioner-add-nodepool-match-check branch from 06681b9 to ddff710 Compare February 21, 2025 21:55
@nstogner nstogner requested a review from echiugoog as a code owner February 21, 2025 21:55
@nstogner nstogner force-pushed the tpu-provisioner-add-nodepool-match-check branch from ddff710 to 64973ac Compare February 21, 2025 21:58
@nstogner
Copy link
Collaborator Author

Updated to include an interface for interacting with GKE NodePool API in order to add some tests.

Copy link
Member

@echiugoog echiugoog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@roberthbailey
Copy link
Collaborator

@nstogner - Is this PR still a WIP or should the title be updated?

@nstogner nstogner changed the title WIP: TPU Provisioner: Node pool hash comparison TPU Provisioner: Node pool hash comparison Feb 26, 2025
@nstogner
Copy link
Collaborator Author

@nstogner - Is this PR still a WIP or should the title be updated?

Updated title

Copy link

@Ethanlm Ethanlm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you pinpoint where the second issue is fixed in the code?
I am not sure if I understand the second issue :(

@nstogner
Copy link
Collaborator Author

Could you pinpoint where the second issue is fixed in the code? I am not sure if I understand the second issue :(

https://github.com/GoogleCloudPlatform/ai-on-gke/pull/967/files#r1973622851

@nstogner nstogner requested a review from Ethanlm March 4, 2025 19:00
@nstogner nstogner force-pushed the tpu-provisioner-add-nodepool-match-check branch from 7511f6d to 1e6c285 Compare March 5, 2025 18:32
@nstogner nstogner merged commit 393dd27 into GoogleCloudPlatform:main Mar 5, 2025
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants