[Question] How to recover from provision state failed at Node Pool #4674

jkroepke · 2024-11-27T16:13:55Z

Describe scenario
We are running multiple AKS and we have weekly automatic patching enabled.

Some of our critical pod have a PDB configured.

In some rate condition, the automatic update failed. The reason is that there is a timeout. A configured PDB, deny to drain a node.

Question

To recover from that situation, i have to manually restart the pod. Thats fine.

However. The node pool remains in Failed state, including the extra nodes remains as well.

How I can recover from that state? How I can re-trigger the automatic update?

One solution is manually delete the old VM from the VMSS. But thats kinda tricky on Node Pools which large amount of nodes.

JoeyC-Dev · 2024-12-02T03:39:36Z

az aks update -n $aks -g $rG

No other arguments/parameters.

jkroepke · 2024-12-02T07:31:39Z

So is no Portal Experience, right? I can't trigger update at portal, if its on the latest version.

JoeyC-Dev · 2024-12-02T08:00:10Z

So is no Portal Experience, right? I can't trigger update at portal, if its on the latest version.

From document, yes. Maybe someone else knows it is hidden somewhere.

jkroepke · 2024-12-02T09:02:57Z

Thanks, I will try that on next incident!

jkroepke added the question label Nov 27, 2024

Provide feedback