Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: Enforce consistent Maintenance Mode Strategy behavior #7160

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

m-ildefons
Copy link
Contributor

@m-ildefons m-ildefons commented Dec 13, 2024

Problem:

As described in #6835, there are several problems to be considered here:

  1. Ensuring that the maintenance-mode strategy is specified
  2. Ensuring that only a valid maintenance-mode strategy can be specified
  3. Ensuring that the specified maintenance-mode strategy label is propagated all the way to all resources where it's expected
  4. Ensuring that default behavior is applied in all cases where a maintenance-mode strategy either isn't specified or isn't valid

Solution:

The solution to the problem consists of a modification to the node drain helper and modifications both the mutating admission webhook and the validating admission webhook for the VirtualMachine resource

The Node Drain Helper

The node drain helper was only marking Pods of VMs which had the maintenance-mode strategy label set to Migrate for deletion (i.e. migration off of the node).
This behavior would omit all Pods which didn't have this label set at all, or the ones which had it set to an invalid value. Similarly, Pods of VMs which didn't have one of the three other maintenance-mode strategies set would not be shut down. Thereby any VM without a maintenance-mode strategy, or with an invalid maintenance-mode strategy would be left running, preventing the node from entering the maintenance mode successfully. As a result the node would remain stuck in "Cordoned" state, waiting for an operator to intervene and deal with the VMs.
The new behavior of the drain helper is to migrate any VMs off of the node, if the associated Pod either has no maintenance-mode strategy label, the maintenance-mode strategy label is invalid, or the maintenance-mode strategy is specified as Migrate. This is the documented default behavior for VMs.
All other VMs will have a different maintenance-mode strategy specified and they will thus be shut down (and later restarted according to their maintenance-mode strategy).

Admission Webhook

The admission webhook for the VirtualMachine resource plays a crucial role in making sure that VirtualMachiness have a sane maintenance-mode strategy specified.
First, the mutating webhook ensures that the maintenance-mode strategy label is set on the VirtualMachine resource. If it isn't, it will be set to Migrate, which is the default.
Then the mutating webhook ensures that the maintenance-mode strategy label that is set on the VirtualMachine resource is propagated into the .spec.templates.metadata.labels, which is to ensure that the label is propagated throughout both the VirtualMachineInstance as well as the Pod belonging to that VirtualMachine.
The mutating webhook will not modify exising maintenance-mode strategy labels on existing VirtualMachine resources. While the node drain helper will apply the default strategy to these virtual machines, this is to ensure that the user will not be faced with inexplicably changing maintenance-mode strategy labels, if they have an invalid value in their cluster.
The webhook will however log an appropriate message, allowing the user to identify VirtualMachine resources affected by invalid values in the maintenance-mode strategy label and correct it.

The validating webhook takes the role of a safe guard, ensuring that no VirtualMachine can be created or updated with an ill-specified maintenance-mode strategy.
It will reject the creation of VirtualMachine resources without a maintenance-mode strategy label or with an invalid value in the maintenance-mode strategy label.
This ensures that the label can not accidentally be filled with bogus data and that the user is informed.

Related Issue:

Test plan:

Scenario 1

Ensure that new VirtualMachines have a valid maintenance-mode strategy. If it's not given, the default should be applied.

Test 1

No maintenance-mode strategy given.

Setup:

Create a new VirtualMachine and specify no maintenance-mode strategy:

type: kubevirt.io.virtualmachine
metadata:
  namespace: default
  labels:
    harvesterhci.io/creator: harvester
    harvesterhci.io/os: linux
  name: test
[...]

Expected Result

The default strategy Migrate should have been added by the mutating admission webhook:

type: kubevirt.io.virtualmachine
metadata:
  namespace: default
  labels:
    harvesterhci.io/maintain-mode-strategy: Migrate
    [...]
  name: test
[...]

Test 2

Valid maintenance-mode strategy given.

Setup:

Create a new VirtualMachine and specify a valid maintenance-mode strategy:

type: kubevirt.io.virtualmachine
metadata:
  namespace: default
  labels:
    harvesterhci.io/maintain-mode-strategy: Shutdown
    [...]
  name: test
[...]

Expected Result

The previously specified maintenance-mode strategy should be persisted in the VirtualMachine resource's metadata:

type: kubevirt.io.virtualmachine
metadata:
  namespace: default
  labels:
    harvesterhci.io/maintain-mode-strategy: Shutdown
    [...]
  name: test
[...]

Scenario 2

If an invalid maintenance-mode strategy is specified, the VirtualMachine resource should be rejected.

Test 1

Invalid maintenance-mode strategy given.

Setup:

Create a new VirtualMachine and specify an invalid maintenance-mode strategy:

type: kubevirt.io.virtualmachine
metadata:
  namespace: default
  labels:
    harvesterhci.io/maintain-mode-strategy: foobar
    [...]
  name: test
[...]

Expected Result

The creation should be rejected with a sensible error message

The request is invalid: metadata.labels[harvesterhci.io/maintain-mode-strategy]: invalid maintenance mode strategy: foobar

Scenario 3

The specified maintenance-mode strategy should be propagated to all relevant resources.

Test 1

The maintenance-mode strategy is propagated to the VirtualMachineInstance and the corresponding Pod, when the VirtualMachine is created

Setup

Create a new VirtualMachine and specify a valid maintenance-mode strategy:

type: kubevirt.io.virtualmachine
metadata:
  namespace: default
  labels:
    harvesterhci.io/maintain-mode-strategy: Shutdown
    [...]
  name: test
[...]

Expected Result

The VirtualMachine resource, the VirtualMachineInstance resource and the belonging Pod resource all have the specified maintenance-mode strategy label.

│ ~ │► kubectl get vm,vmi,pod -l harvesterhci.io/maintain-mode-strategy=Shutdown
NAME                              AGE   STATUS    READY
virtualmachine.kubevirt.io/test   10m   Running   True

NAME                                      AGE   PHASE     IP           NODENAME           READY
virtualmachineinstance.kubevirt.io/test   10m   Running   10.52.1.65   harvester-node-1   True

NAME                           READY   STATUS    RESTARTS   AGE
pod/virt-launcher-test-n5tf2   2/2     Running   0          10m

Test 2

The maintenance-mode strategy is updated on the VirtualMachineInstance and on the corresponding Pod, when the maintenance-mode strategy is updated on the VirtualMachine resource.

//TODO

Scenario 4

In case a VirtualMachine resource, VirtualMachineInstance resource, or a corresponding Pod exists and either has no maintenance-mode strategy label, or the maintenance-mode strategy label has an invalid value, the default behavior (Migrate) should be applied.

// TODO

The Pod deletion filter identifies the Pods belonging to VMs, which need
to be deleted during a node drain by looking at their maintenance-mode
strategy label. The default value for this label is `Migrate`, which
indicates that the Pod should be deleted during the node drain.
Pods with an invalid label, i.e. where the value of the maintenance-mode
strategy label is not one of:
  - Migrate
  - ShutdownAndRestartAfterEnable
  - ShutdownAndRestartAfterDisable
  - Shutdown
are now also being treated with the default behavior.
This fixes the problem that if a VM contains an invalied value in this
label, nodes can become stuck in `Cordoned` state when transitioning to
maintenance mode, as the node drain controller won't shut down the VM,
but it's also not migrated away from the node. Therefore the VM keeps
running, preventing the node from completely transitioning into
maintenance mode.

related-to: harvester#6835

Signed-off-by: Moritz Röhrich <[email protected]>
Add checks for maintenance-mode strategy to VM webhooks.

The Admission webhooks for the VirtualMachine resource needs to make
sure that the maintenance-mode strategy for the VM is set to a sane
value.
To do this, there are two checks:

The first one is in the mutating webhook, which is executed first, and
it just makes sure that the label, which defines the maintenance-mode
strategy, is set. If the label is not set, the mutating webhook will set
it with the default value of `Migrate`. If the label is set, the
mutating webhook will not modify it, even if it has an invalid value.
This ensures that the maintenance-mode strategy is never set
unintentionally to a wrong value.
The second check will ensure that in this case the request is rejected
with an error message, so the user can correct the value of the
maintenance-mode strategy label.
The mutating webhook will also ensure that the maintenance-mode strategy
label is copied from the `.metadata.labels` to
`.spec.template.metadata.labels`. This is necessary to ensure that the
Pod in which the virtual machine will run will be labeled correctly.

The second check happens in the validating webhook. This check ensures
that the maintenance-mode strategy label is set to a valid
maintenance-mode strategy, i.e. one of the values:
  - Migrate
  - ShutdownAndRestartAfterEnable
  - ShutdownAndRestartAfterDisable
  - Shutdown
This webhook will deny a CREATE or UPDATE, if the new VirtualMachine
resource does not contain the maintenance-mode strategy label at all, or
if it contains an invalid value.
The only exception is an UPDATE to a VirtualMachine resource that
already contains an invalid value in the maintenance-mode strategy label
and where the value does not change. In this case, the webhook will
accept the request. This is crucial, in case the controller needs to
deal with an existing VM that has an invalid value in this label (e.g.
on a cluster that has been upgraded from an old version, before this
label was checked by the admission webhook). In this case, the
controller still needs to be able to perform UPDATE operations on the
resource, to operate the VM.

Together, these two checks ensure that no VirtualMachine resource can be
created with an invalid maintenance-mode strategy, or with no
maintenance-mode strategy at all.
They also make sure that the maintenance-mode strategy can not be
removed or changed to an invalid value for existing VirtualMachine
resources.

related-to: harvester#6835

Signed-off-by: Moritz Röhrich <[email protected]>
@m-ildefons m-ildefons added area/admission-webhook Kubernetes validation and mutating admission webhooks area/node-maintenance labels Dec 13, 2024
@m-ildefons m-ildefons added this to the v1.5.0 milestone Dec 13, 2024
@m-ildefons m-ildefons requested review from bk201 and albinsun December 13, 2024 16:18
@m-ildefons m-ildefons self-assigned this Dec 13, 2024
@bk201 bk201 requested review from ibrokethecloud and w13915984028 and removed request for albinsun December 19, 2024 07:57
@w13915984028
Copy link
Member

A question about the scenario:

Some VMs can‘t be migrated due to selector or dedicated devices. And sometimes user doesn’t want a VM to be migrated accidentally.

The original design respected this and only process it when a valid strategy is set and skip any unset/invalid marks.

@m-ildefons m-ildefons requested a review from bk201 December 19, 2024 14:32
@m-ildefons
Copy link
Contributor Author

A question about the scenario:

Some VMs can‘t be migrated due to selector or dedicated devices. And sometimes user doesn’t want a VM to be migrated accidentally.

The original design respected this and only process it when a valid strategy is set and skip any unset/invalid marks.

In that case, the user should set one of the other maintenance-mode strategies. It is documented that VMs without a maintenance-mode strategy are migrated as well: https://docs.harvesterhci.io/v1.4/host/#node-maintenance
With the maintenance-mode label ShutdownAndRestartAfterDisable, they can even have the VMs be started again automatically, once the node is no longer in maintenance mode.

Users can also manually shut down VMs prior to enabling the maintenance mode, if they want to prevent migration.

I've re-worked the label synchronization between VMI resources and Pods, so now the maintenance-mode strategy can be updated without re-starting the VM as well.

Add new VMI controller, which syncs all labels, which have the
`harvesterhci.io` prefix Pod(s) corresponding to the VMI.
This new controller ensures that Harvester's label namespace is
propagated all the way down to the Pod of a VirtualMachine.
Changes to the labels of a VirtualMachine or VirtualMachineInstance do
not justify re-creating the Pod, since labels are metadata. But labels
like the maintenance-mode strategy label have influence on the behavior
of VMs, and they may also be needed for features like load balancers to
work properly.
This controller syncronizes the labels from the VirtualMachineInstance
to the Pod to make sure that any change on the upstream objects is
reflected on the Pod.
For the maintenance-mode strategy label specifically, this means that
the maintenance-mode strategy of a VM can be updated without needing to
restart the VM.

Signed-off-by: Moritz Röhrich <[email protected]>
Copy link
Member

@w13915984028 w13915984028 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check those questions, thanks.

pkg/webhook/resources/virtualmachine/validator.go Outdated Show resolved Hide resolved
}

// copy labels from VMI to Pod
for label := range harvesterVMILabels {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Below is a bit faster

		for label, value := range harvesterVMILabels {
			pod.Labels[label] = value
		}

pod.Labels[label] = harvesterVMILabels[label]
}

_, err := h.podClient.Update(pod)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vmi onChange happens many times, use deepEqual to check if pod is really changed then call update will be faster

the deepEqual runs in local go thread; the .Update runs on apiServer, the latter is much slower

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made several changes that I think should improve the performance of this entire method.
Avoiding the Update as much as possible is one of them, but the DeepCopy also now only happens when the Update happens.
Additionally, I've replaced a few of the for-loops with functions from the standard library and I've gotten rid of the dynamically growing array.
The last big improvement in this department is that I'm now using a different Pod indexer, which only returns Pods associated with the VirtualMachine, not all Pods running on the node where the VirtualMachineInstance is running, so there are a lot less Pods to look through.

patchOps = append(patchOps, string(patch[:]))
}

// If the value under .spec.template.metadata.labels does not match the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to differentiate the create and update case:

for create, a wrong strategy should be rejected to tell user the clear error, then he will notice and adjust it;

for update, it may come from legacy vm, adjust the wrong value is user friendly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mutator will not reject anything, since it's not its job.

The mutator will just make sure that the label is set (or it will set it itself to the default) and that the label in the .spec.template matches the label in .metadata.

If the label is set, but to an invalid value, the mutator will not change it, because this would be unexpected by the user.
However, the validator will later reject the operation, if the label is invalid.

@@ -79,6 +79,11 @@ func (m *vmMutator) Create(_ *types.Request, newObj runtime.Object) (types.Patch
return nil, err
}

patchOps, err = m.ensureMaintenanceModeStrategyIsSet(vm, patchOps)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as below comment, in create case, do not change user's input silently

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the create case, the user's input will never be changed. The mutator will only add the label if it doesn't exist.

// SyncHarvesterVMILabelsToPod ensures that all Harvester labels (i.e. those
// with the harvesterhci.io/ prefix) from the VirtualMachineInstance are synced
// to the Pod
func (h *VMIController) SyncHarvesterVMILabelsToPod(_ string, vmi *kubevirtv1.VirtualMachineInstance) (*kubevirtv1.VirtualMachineInstance, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this function may have performance issue

in which case, the strategy is not synced to to pod correctly? could we handle the specific one, instead of run this funcion on VMI change? VMI changes many times when a new VM is starting/migrating

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't seen performance issues in my testing.
Regardless, I've applied the suggestions from the first round of review and made some more improvements to this method, which all should help with performance or at least robustness.

@w13915984028
Copy link
Member

btw, from 12.20, I will be offline and not review PRs.

please move the PR forward if it meets approval, do not wait for me, thanks.


logrus.Debugf("Syncing labels %v for VMI %v to Pod", vmi.Labels, vmi.Name)

harvesterVMILabels := map[string]string{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should skip the VMI if the maintenance strategy is Migrate. This will allow kubevirt to just handle the migration of VM's by watching for nodes being placed into maintenance mode via its own watchers.
We do not need to do anything specific for virt-launcher pods in this case.

The additional benefit of this will be that we will not be patching the pod labels for every launcher pod. It is very likely that Maintenance Stratergy is only used by a subset of VM's among a large number of end user clusters. This greatly reduces the number of reconciles / update requests to the apiserver for the pods.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This controller is not just for the maintenance-mode functionality. It syncs all labels with the harvesterhci.io prefix, which may come in handy later on, e.g. when we need to match Pods to their Rancher downstream clusters, or when we need to label Pods of node-pools, so load-balancers can identify them.

I'm also not a huge fan of implementing this functionality in a controller, but I don't see many other options.

@@ -296,3 +307,41 @@ func (v *vmValidator) checkReservedMemoryAnnotation(vm *kubevirtv1.VirtualMachin
func (v *vmValidator) checkStorageResourceQuota(vm *kubevirtv1.VirtualMachine, oldVM *kubevirtv1.VirtualMachine) error {
return v.rqCalculator.CheckStorageResourceQuota(vm, oldVM)
}

func checkMaintenanceModeStrategyIsValid(newVM, oldVM *kubevirtv1.VirtualMachine) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should fine tune this more to only ignore updates when oldVM and newVM migration strategy changes and the new strategy is not a valid strategy.

An old VM from say 1.4.0 which does not have the strategy applied as default will not have a strategy. Any update on this VM will be rejected by the validator, which may have unexpected results.

For example, I deployed the harvester / harvester-webhook to an existing 1.4.0 cluster to mimic the upgrade, and post upgrade, I still have VM's without the migration strategy setup.

As expected hot plug volume attachments are broken to this VM because the webhook is blocking the request.

node-1-dhcp:~ # virtctl addvolume test --volume-name=attachment --persist
You are using a client virtctl version that is different from the KubeVirt version running in the cluster
Client Version: v1.1.0
Server Version: v1.2.2
error adding volume, admission webhook "validator.harvesterhci.io" denied the request: label `harvesterhci.io/maintain-mode-strategy` not found

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Weird... seems like the mutating webhook didn't run, otherwise it would have added the label. But then the validating webhook was run.
I'm not sure if virtctl addvolume is an update operation or a patch.

Simplify control flow for the maintenance-mode strategy label validation
in the VirtualMachine validating admission webhook.

Signed-off-by: Moritz Röhrich <[email protected]>
@m-ildefons m-ildefons force-pushed the fix-6835 branch 2 times, most recently from cc1011b to e1d1d55 Compare December 20, 2024 15:54
Improve SyncHarvesterVMILabelsToPod control loop:

- Make use of the `maps` module from the standard library for cloning,
  comparing, copying. Also use `slices` module from the standard library
  for similar operations on arrays.
- Slight performance improvement in for loops over maps by avoiding
  double lookups of values
- Use differen Pod indexer to reduce the number of Pods needed to look
  through to find the Pods belonging to the VMI
- Check that labels really need updating before committing to expensive
  operations like DeepCopy and Update on the Pods
- Remove ad-hoc allocation and dynamic growing of array

Signed-off-by: Moritz Röhrich <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/admission-webhook Kubernetes validation and mutating admission webhooks area/node-maintenance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants