Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] machine_labels are 'sticky' to the position of the machine_pool in the list of machine_pools #1254

Open
hansbogert opened this issue Oct 19, 2023 · 6 comments
Assignees
Labels

Comments

@hansbogert
Copy link

hansbogert commented Oct 19, 2023

Rancher Server Setup

  • Rancher version: 2.7.6
  • Installation option (Docker install/Helm Chart):
    • If Helm Chart, Kubernetes Cluster and version (RKE1, RKE2, k3s, EKS, etc): rke2

Information about the Cluster

  • Kubernetes version: 1.26
  • Cluster Type (Local/Downstream): local
    • If downstream, what type of cluster? (Custom/Imported or specify provider for Hosted/Infrastructure Provider):

Provider Information

  • What is the version of the Rancher v2 Terraform Provider in use? 3.1.x
  • What is the version of Terraform in use? 1.5.2

Describe the bug

having machine labels in machine_pools causes unnecessary recreation of machine pools during deletion of other machine_pools. Worse though is that these newly recreated machine_pool machines now can have wrong machine_labels on them.

The situation in which this actually occurs is explained below.

To Reproduce

Have a cluster with the following machine_pools (in pseudo-config)

machine_pools:
  - pool1
  - pool2
    machine_labels:
      foo: bar
  - pool3

Remove pool1 using Terraform

Actual Result

in pseudo config:

machine_pools:
  - pool2
    machine_labels:
      foo: bar
  - pool3
    machine_labels:
      foo: bar

Expected Result

In pseudo config:

machine_pools:
  - pool2
    machine_labels:
      foo: bar
  - pool3

Additional context

There are multiple issues at play here:

  • machine_pools is a list. A set or map would've been a better choice.
  • The machine_labels is optional, but computed in the Terraform schema. Why is this computed? This is an edge case in the Terraform SDK, similar to this one

If the machine_labels does not need to be computed, then this issue is easily solved by removing the computed attribute. I've verified that the behavior is then correct.

@hansbogert hansbogert changed the title [BUG] machine_labels are [BUG] machine_labels are 'sticky' to the position of the machine_pool in the list of machine_pools Oct 19, 2023
@a-blender a-blender self-assigned this Nov 25, 2023
@kkaempf kkaempf removed the team/area2 label Dec 7, 2023
@hansbogert
Copy link
Author

@a-blender @kkaempf a gentle nudge for this issue; Is everything clear, or is more info or reproduction details needed?

@kkaempf kkaempf assigned snasovich and unassigned a-blender Apr 11, 2024
@kkaempf
Copy link
Contributor

kkaempf commented Apr 11, 2024

Reassigning for better visibility.

@hansbogert
Copy link
Author

Is something blocking this? @snasovich @Jono-SUSE-Rancher

@hansbogert
Copy link
Author

hansbogert commented Nov 19, 2024

@snasovich Again we hit this in production because it's such a subtle thing to get wrong causing redeploys of machines with wrong labels. Is there something we can do to help fix this?

@pvlkov
Copy link

pvlkov commented Nov 28, 2024

Hello @hansbogert,

we have been monitoring this issue for quite some time now because we are also facing it in our Rancher-provisioned on-premise clusters.

We have looked through the code in order to understand where this problem even comes from, and may have found a workaround. We have been using this workaround for quite some time now and it seems to have gotten rid of the problem of "inherited" node-pool labels (and also taints, which have the same problem) for us.

In our opinion this problem arises from the following checks in the flattenClusterV2RKEConfigMachinePools function

func flattenClusterV2RKEConfigMachinePools(p []provisionv1.RKEMachinePool) []interface{} {

and

This leads to the fact that when the labels map and the taints slice are empty for the new pool, the returned list from this function does not include the corresponding "labels" or "taints" key for the newly created pool and as such Terraform does not register a diff for these fields. Therefore the old taints and labels of the pool previously inheriting this position in the full list of machine_pools are not removed.

Our workaround consists of removing the two length checks mentioned above. This is unproblematic because in the case of the flattenTaintsV2 function which is called, a len > 0 check is even included in the function. For the toMapInterface function an empty map is returned, which is also fine and does not break anything.

Only downside of removing this check is that it suggests that removing taints on existing pools is actually possible (since Terraform will now report a diff when you remove all taints from an existing pool) while in reality, adding or removing taints is only possible on node registration and will have no effect on existing nodes of a pool.

Using this workaround we are able to circumvent this issue by maintaining our own build of the Terraform provider binary and using it with the developer overrides functionality of Terraform. This is of course a little cumbersome and an upstream fix would be preferable, but due to the taints problem mentioned above, this is not something we have suggested to the maintainers (yet).

Hope this helps you with your issue as well and maybe we can have some input from the maintainers, whether this could be properly incorporated in the provider at some point..

@pvlkov
Copy link

pvlkov commented Nov 28, 2024

By the way, the same issue (with different root cause) also exists for machineDeploymentAnnotations used by the cluster-autoscaler. These are always of the form "cluster.provisioning.cattle.io/...", like for example the max pool-size annotation "cluster.provisioning.cattle.io/autoscaler-max-size". These are also inherited to newly created pools, because this check

DiffSuppressFunc: func(k, old, new string, d *schema.ResourceData) bool {
suppresses the diff for these annotations.

But this is probably a topic for a separate issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants