Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

After a reboot, the agent node cannot rejoin rke2 cluster unless previous boot's agent secret is removed #7154

Closed
robotarmy opened this issue Oct 29, 2024 · 3 comments

Comments

@robotarmy
Copy link

robotarmy commented Oct 29, 2024

Environmental Info:
RKE2 Version:
rke2 version v1.30.5+rke2r1 (0c83bc8)
go version go1.22.6 X:boringcrypto

Node(s) CPU architecture, OS, and Version:

Fedora CoreOS 40.20241006.3.0
Linux farmbot93.yyy.zzz 6.10.12-200.fc40.x86_64 #1 SMP PREEMPT_DYNAMIC Mon Sep 30 21:38:25 UTC 2024 x86_64 GNU/Linux

Cluster Configuration:
3 servers, 4 agents

Describe the bug:
On restart of an agent node - the agent cannot join the cluster with the secret stored on /etc/rancher/node/password

Steps To Reproduce:
I have a system unit for the agent which runs a script

            /bin/echo ">>>> Fetching Cluster Configuration <<<<"
            # INSTALL CONFIG
            /bin/curl -k https://config.private/agent_config.yaml -o /etc/rancher/rke2/config.yaml
            /bin/chmod 600 /etc/rancher/rke2/config.yaml
            /bin/echo ">>>> Found farmbot - setting rke2-agent"
            /bin/systemctl enable rke2-agent.service
            # use noblock to allow this service to exit clean let rke2-agent mind it's own business
            /bin/systemctl start --no-block --now rke2-agent.service

The intention of the script is to start the rke2-agent and to ensure that the current agent configuration is on the host. I do this because core-os ignores the /bin/systemctl enable rke2-agent.service between reboots. The core os ignition file injects this to run on boot.

The password is for example
6ea01d43a4b573d91e182de8f10bce2f

on Reboot the password becomes
a93405ee4af52472a547eb95e1a62301

When the secret is removed the new secret can be added and the node joins the cluster

  • Installed RKE2:

Installed RKE2 via tarball on using a one-off system unit which downloads the runtime and then executes tarball based install.

Expected behavior:

I expected the node password not to be reset by my script on boot.
I expected that the config agent-token would generate some password hash deterministically across reboots.

I expect that neither
/bin/systemctl start --no-block --now rke2-agent.service nor /bin/systemctl enable rke2-agent.service
would generate a new password.

Actual behavior:

Each reboot generates a new node password, the node password cannot be used to rejoin. When the node-password secret for the node is removed - the node rejoins.

Additional context / logs:

The log complains about a pre-existing node with the same name in the cluster.

** Please indicate if I'm misunderstanding something or using this incorrectly **

Thank you.

@brandond
Copy link
Member

brandond commented Oct 29, 2024

I expected that the config agent-token would generate some password hash deterministically across reboots.

No, a deterministically generated "hash" would not be a very good password. The behavior you are observing here is specifically covered in the documentation: https://docs.rke2.io/advanced

How Agent Node Registration Works

Agents register with the server using the cluster secret portion of the join token, along with a randomly generated node-specific password, which is stored on the agent at /etc/rancher/node/password. The server will store the passwords for individual nodes as Kubernetes secrets, and any subsequent attempts must use the same password. Node password secrets are stored in the kube-system namespace with names using the template <host>.node-password.rke2. These secrets are deleted when the corresponding Kubernetes node is deleted.

Note: Prior to RKE2 v1.20.2 servers stored passwords on disk at /var/lib/rancher/rke2/server/cred/node-passwd.

If the /etc/rancher/node directory of an agent is removed, the password file should be recreated for the agent prior to startup, or the entry removed from the server or Kubernetes cluster (depending on the RKE2 version).

To resolve this, you will need to do one of the following:

  • persist /etc/rancher/node/password across reboots
  • deterministically generate the contents of this file prior to starting RKE2
  • ensure that the node password secrets are removed whenever you reboot your nodes

@robotarmy
Copy link
Author

@brandond - my understanding is that the /etc/rancher/ directory is maintained across reboots.

/etc and /var are allowed state iirc, which is part of my confusion - I could be wrong - can update my script to indicate if it is finding a node password prior to starting the system.

it seems that in anyways- I can work around by pre-generating the contents of this file in a uniform way -

I'm having a bit of trouble understanding how this would impact token rotation. Thank you for your input and for answering my bug.

@robotarmy
Copy link
Author

Ah i see my understanding of /etc is incorrect actually. I need to take into account ostree semantics.

Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants