-
Notifications
You must be signed in to change notification settings - Fork 217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TPU Provisioner: Node pool hash comparison #967
TPU Provisioner: Node pool hash comparison #967
Conversation
06681b9
to
ddff710
Compare
ddff710
to
64973ac
Compare
Updated to include an interface for interacting with GKE NodePool API in order to add some tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@nstogner - Is this PR still a WIP or should the title be updated? |
Updated title |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you pinpoint where the second issue is fixed in the code?
I am not sure if I understand the second issue :(
https://github.com/GoogleCloudPlatform/ai-on-gke/pull/967/files#r1973622851 |
7511f6d
to
1e6c285
Compare
Issue:
Today the provisioner just checks if a node pool with the expected name exists for a given workload. If the workload is quickly recreated with a different node selector it is possible that the provisioner will not delete the old node pool in time and it will not think it needs to create a new one since the old one exists.
Fix:
This PR introduces new logic into the provisioner to check that the existing node pool matches the desired node pool via a hash comparison. Hash is calculated at node pool creation time and stored as a node pool label for later comparison.
Secondary issue & fix:
The reconciler responsible for node pool deletion used to watch for changes to Pods (such as being terminated off of a Node) in addition to changes to Nodes (ex: an update to the Node's status by kubelet) in order to know when to check whether a NodePool was free to delete. A historical PR removed the Pod watch causing a regression. As a result, changes to Node objects are now relied on - this can be slow as Node updates are infrequent (a Pod terminating does not result in an update to the Node object).
This secondary issue was fixed by re-adding the Pod watch into the deletion reconciler.