TPU Provisioner: Node pool hash comparison #967

nstogner · 2025-02-06T19:00:49Z

Issue:

Today the provisioner just checks if a node pool with the expected name exists for a given workload. If the workload is quickly recreated with a different node selector it is possible that the provisioner will not delete the old node pool in time and it will not think it needs to create a new one since the old one exists.

Fix:

This PR introduces new logic into the provisioner to check that the existing node pool matches the desired node pool via a hash comparison. Hash is calculated at node pool creation time and stored as a node pool label for later comparison.

Secondary issue & fix:

The reconciler responsible for node pool deletion used to watch for changes to Pods (such as being terminated off of a Node) in addition to changes to Nodes (ex: an update to the Node's status by kubelet) in order to know when to check whether a NodePool was free to delete. A historical PR removed the Pod watch causing a regression. As a result, changes to Node objects are now relied on - this can be slow as Node updates are infrequent (a Pod terminating does not result in an update to the Node object).

This secondary issue was fixed by re-adding the Pod watch into the deletion reconciler.

tpu-provisioner/internal/cloud/gke.go

nstogner · 2025-02-21T21:59:43Z

Updated to include an interface for interacting with GKE NodePool API in order to add some tests.

tpu-provisioner/internal/cloud/gke.go

echiugoog

LGTM

roberthbailey · 2025-02-26T07:43:50Z

@nstogner - Is this PR still a WIP or should the title be updated?

nstogner · 2025-02-26T12:24:17Z

@nstogner - Is this PR still a WIP or should the title be updated?

Updated title

Ethanlm

Could you pinpoint where the second issue is fixed in the code?
I am not sure if I understand the second issue :(

tpu-provisioner/internal/cloud/gke.go

tpu-provisioner/internal/controller/deletion_controller.go

nstogner · 2025-02-27T13:52:24Z

Could you pinpoint where the second issue is fixed in the code? I am not sure if I understand the second issue :(

https://github.com/GoogleCloudPlatform/ai-on-gke/pull/967/files#r1973622851

SidneyShen reviewed Feb 6, 2025

View reviewed changes

tpu-provisioner/internal/cloud/gke.go Outdated Show resolved Hide resolved

SidneyShen reviewed Feb 11, 2025

View reviewed changes

tpu-provisioner/internal/cloud/gke.go Show resolved Hide resolved

SidneyShen approved these changes Feb 11, 2025

View reviewed changes

nstogner force-pushed the tpu-provisioner-add-nodepool-match-check branch from 06681b9 to ddff710 Compare February 21, 2025 21:55

nstogner requested a review from echiugoog as a code owner February 21, 2025 21:55

nstogner force-pushed the tpu-provisioner-add-nodepool-match-check branch from ddff710 to 64973ac Compare February 21, 2025 21:58

echiugoog reviewed Feb 24, 2025

View reviewed changes

tpu-provisioner/internal/cloud/gke.go Show resolved Hide resolved

echiugoog approved these changes Feb 25, 2025

View reviewed changes

nstogner changed the title ~~WIP: TPU Provisioner: Node pool hash comparison~~ TPU Provisioner: Node pool hash comparison Feb 26, 2025

SidneyShen approved these changes Feb 26, 2025

View reviewed changes

Ethanlm reviewed Feb 27, 2025

View reviewed changes

tpu-provisioner/internal/cloud/gke.go Show resolved Hide resolved

tpu-provisioner/internal/cloud/gke.go Outdated Show resolved Hide resolved

nstogner commented Feb 27, 2025

View reviewed changes

tpu-provisioner/internal/controller/deletion_controller.go Show resolved Hide resolved

nstogner requested a review from Ethanlm March 4, 2025 19:00

Ethanlm approved these changes Mar 4, 2025

View reviewed changes

nstogner added 8 commits March 5, 2025 13:30

TPU Provisioner: Node pool hash comparison

5b491ed

Update hash to be selective

4b3d906

Update go.mod

8c64315

Use interface for GKE interactions and start test case

3936ba0

Add test cases

b697a98

Add logging, improve error handling

17ea326

Check if NP needs deletion immediately after Pod termination

3b0a3eb

Add res affinity and remove taints from hash func

1e6c285

nstogner force-pushed the tpu-provisioner-add-nodepool-match-check branch from 7511f6d to 1e6c285 Compare March 5, 2025 18:32

Address comments

8622991

nstogner merged commit 393dd27 into GoogleCloudPlatform:main Mar 5, 2025
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TPU Provisioner: Node pool hash comparison #967

TPU Provisioner: Node pool hash comparison #967

nstogner commented Feb 6, 2025 •

edited

Loading

nstogner commented Feb 21, 2025

echiugoog left a comment

roberthbailey commented Feb 26, 2025

nstogner commented Feb 26, 2025

Ethanlm left a comment

nstogner commented Feb 27, 2025

TPU Provisioner: Node pool hash comparison #967

TPU Provisioner: Node pool hash comparison #967

Conversation

nstogner commented Feb 6, 2025 • edited Loading

nstogner commented Feb 21, 2025

echiugoog left a comment

Choose a reason for hiding this comment

roberthbailey commented Feb 26, 2025

nstogner commented Feb 26, 2025

Ethanlm left a comment

Choose a reason for hiding this comment

nstogner commented Feb 27, 2025

nstogner commented Feb 6, 2025 •

edited

Loading