Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rke2-server creates improper etcd member name #5482

Closed
airbjorn opened this issue Feb 16, 2024 · 13 comments
Closed

rke2-server creates improper etcd member name #5482

airbjorn opened this issue Feb 16, 2024 · 13 comments
Assignees

Comments

@airbjorn
Copy link

airbjorn commented Feb 16, 2024

Environmental Info:
RKE2 Version:
rke2 version v1.26.13+rke2r1
go version go1.20.13 X:boringcrypto

Node(s) CPU architecture, OS, and Version:
Linux kma004.hiddendomain.tld 5.14.0-362.18.1.el9_3.x86_64 #1 SMP PREEMPT_DYNAMIC Wed Jan 3 15:54:45 EST 2024 x86_64 x86_64 x86_64 GNU/Linux

Cluster Configuration:

  • k3s Cluster Rancher 2.7.10 is running on
  • dedicated VMs: 3 Master Nodes (controlplane+etcd), 2 Worker Nodes

Describe the bug:
After creating the new rke2 cluster, the etcd member names consist only of the suffix beginning with -

Steps To Reproduce:

  1. create new rke2 Custom Cluster with 3 Master and 2 Worker Nodes
  2. Cloud Provider: Default - RKE2 Embedded
  3. Container Network: cilium
  4. Selection of CIS Profile doesn't seem to make any difference
  5. PSA Configuration Template: rancher-privileged

Expected behavior:

  • The Master Nodes enter the Condition "EtcdIsVoter"
  • the etcd member names follow the scheme "-"

Actual behavior:

  • All Master Nodes get "Ready"
  • however they do not reach the Condition "EtcdIsVoter"
  • the etcd member names shown in the list only follow unusual scheme "-<suffix">"

Additional context / logs:

  • DNS resolving is properly configured
  • when updating an existing rke2 from 1.24 to 1.26, the "EtcdIsVoter" condition appears properly
  • "EtcdIsVoter" gets lost when exchanging the Master Node by a new one only on the new Node

---- rke2-server Log start ----
Please find rke2-server Logfile in comment below.
---- rke2-server Log end ----

etcd-members-without-nodename

@airbjorn
Copy link
Author

airbjorn commented Feb 16, 2024

rke2-server.log

@brandond
Copy link
Member

Please don't paste giant log chunks inline. If you're going to send more than a small handful of lines, attach a file to your comment.

@airbjorn airbjorn changed the title rke2-server does not create etcd member name with hostname-suffix any more rke2-server creates improper etcd member name Feb 16, 2024
@brandond
Copy link
Member

brandond commented Feb 16, 2024

Can you provide the output of kubectl get node -l node-role.kubernetes.io/etcd=true -o yaml - please attach this instead of pasting directly into a comment.

Can you also show the output of cat /var/lib/rancher/rke2/server/db/etcd/name on the etcd nodes?

If you have any clusters that have not yet been upgraded, I would be curious to see the same info on similarly configured but not yet upgraded nodes. Do your nodes have the correct name within the etcd cluster prior to upgrading?

@airbjorn
Copy link
Author

airbjorn commented Feb 16, 2024

The attached file was generated on the rke2 cluster, which has been created with v1.26 directly:
master-nodes.yaml.txt

@airbjorn
Copy link
Author

..and this rke2 cluster has once been created with v1.24 and meanwhile upgraded to v1.26. Yet, no nodes has been replaced:
master-nodes-upgraded-rke2.yaml.txt

@brandond
Copy link
Member

OK. so just to be clear it does not rename the existing nodes, but when you join new nodes to the cluster, they have the wrong name?

Are you upgrading your clusters by adding new nodes on 1.26, waiting for them to finish joining, and then deleting the 1.24 nodes?

Note that we don't support skipping minor versions when upgrading, you should be going 1.24 -> 1.25 -> 1.26. Reference:
https://kubernetes.io/releases/version-skew-policy/#supported-version-skew

@airbjorn
Copy link
Author

OK. so just to be clear it does not rename the existing nodes, but when you join new nodes to the cluster, they have the wrong name?

Yes. In detail:

  • When I join new nodes to the upgraded cluster, they get the wrong name.
  • When I create a new cluster from scratch with 1.26, they all get the wrong name from the beginning.
  • Both the log file and the screenshot refer to the newly created rke2 cluster with 1.26

Are you upgrading your clusters by adding new nodes on 1.26, waiting for them to finish joining, and then deleting the 1.24 nodes?

The update strategy is:

  • first update to the new rke2 release
  • join new nodes after rke2 updates have been completed

Note that we don't support skipping minor versions when upgrading, you should be going 1.24 -> 1.25 -> 1.26. Reference: https://kubernetes.io/releases/version-skew-policy/#supported-version-skew

Thanks for this hint! Indeed with the rke2 upgrade I skipped 1.25.

@brandond
Copy link
Member

In addition to the etcd/name file I requested up above, can you also grab the rancher-system-agent logs from journald on the same node you provided the rke2-server log from? ds-cen-kma004 I think?

@airbjorn
Copy link
Author

Ah sorry, forgot that one. Here comes the requested info from ds-cen-kma004:

ds-cen-kma004:~$ sudo cat /var/lib/rancher/rke2/server/db/etcd/name -c13fbae3
--> without a newline at the end.

rancher-system-agent-kma004.log

@brandond
Copy link
Member

brandond commented Feb 16, 2024

It looks like for some reason rancher is running a snapshot list command at the same time it installs and starts rke2. The snapshot list command races with the main rke2 server process to create the name file; if it wins it sets an empty hostname, and this is later used by the server once it finishes starting up.

Feb 16 14:32:20 ds-cen-kma004.hiddendomain.tld rancher-system-agent[3102]: time="2024-02-16T14:32:20+01:00" level=info msg="Rancher System Agent version v0.3.3 (9e827a5) is starting"
Feb 16 14:32:20 ds-cen-kma004.hiddendomain.tld rancher-system-agent[3102]: time="2024-02-16T14:32:20+01:00" level=info msg="Using directory /var/lib/rancher/agent/work for work"
Feb 16 14:32:20 ds-cen-kma004.hiddendomain.tld rancher-system-agent[3102]: time="2024-02-16T14:32:20+01:00" level=info msg="Starting remote watch of plans"
Feb 16 14:32:20 ds-cen-kma004.hiddendomain.tld rancher-system-agent[3102]: E0216 14:32:20.781231    3102 memcache.go:206] couldn't get resource list for management.cattle.io/v3:
Feb 16 14:32:20 ds-cen-kma004.hiddendomain.tld rancher-system-agent[3102]: time="2024-02-16T14:32:20+01:00" level=info msg="Starting /v1, Kind=Secret controller"
Feb 16 14:32:20 ds-cen-kma004.hiddendomain.tld rancher-system-agent[3102]: time="2024-02-16T14:32:20+01:00" level=info msg="Detected first start, force-applying one-time instruction set"
Feb 16 14:32:20 ds-cen-kma004.hiddendomain.tld rancher-system-agent[3102]: time="2024-02-16T14:32:20+01:00" level=info msg="[Applyinator] Applying one-time instructions for plan with checksum bf41949e2ec73ed144e2a2c31619f8e9e09fc8c6fdf3967a3add3b42932f15ab"
Feb 16 14:32:20 ds-cen-kma004.hiddendomain.tld rancher-system-agent[3102]: time="2024-02-16T14:32:20+01:00" level=info msg="[Applyinator] Extracting image docker-all.artifactory.hiddendomain.tld:443/rancher/system-agent-installer-rke2:v1.26.13-rke2r1 to directory /var/lib/rancher/agent/work/20240216-143220/bf41949e2ec73ed144e2a2c31619f8e9e09fc8c6fdf3967a3add3b42932f15ab_0"
Feb 16 14:32:20 ds-cen-kma004.hiddendomain.tld rancher-system-agent[3102]: time="2024-02-16T14:32:20+01:00" level=info msg="Using private registry config file at /etc/rancher/rke2/registries.yaml"
Feb 16 14:32:20 ds-cen-kma004.hiddendomain.tld rancher-system-agent[3102]: time="2024-02-16T14:32:20+01:00" level=info msg="Pulling image docker-all.artifactory.hiddendomain.tld:443/rancher/system-agent-installer-rke2:v1.26.13-rke2r1"

Feb 16 14:32:22 ds-cen-kma004.hiddendomain.tld rancher-system-agent[3102]: time="2024-02-16T14:32:22+01:00" level=info msg="Extracting file installer.sh to /var/lib/rancher/agent/work/20240216-143220/bf41949e2ec73ed144e2a2c31619f8e9e09fc8c6fdf3967a3add3b42932f15ab_0/installer.sh"
Feb 16 14:32:22 ds-cen-kma004.hiddendomain.tld rancher-system-agent[3102]: time="2024-02-16T14:32:22+01:00" level=info msg="Extracting file rke2.linux-amd64.tar.gz to /var/lib/rancher/agent/work/20240216-143220/bf41949e2ec73ed144e2a2c31619f8e9e09fc8c6fdf3967a3add3b42932f15ab_0/rke2.linux-amd64.tar.gz"
Feb 16 14:32:22 ds-cen-kma004.hiddendomain.tld rancher-system-agent[3102]: time="2024-02-16T14:32:22+01:00" level=info msg="Extracting file sha256sum-amd64.txt to /var/lib/rancher/agent/work/20240216-143220/bf41949e2ec73ed144e2a2c31619f8e9e09fc8c6fdf3967a3add3b42932f15ab_0/sha256sum-amd64.txt"
Feb 16 14:32:22 ds-cen-kma004.hiddendomain.tld rancher-system-agent[3102]: time="2024-02-16T14:32:22+01:00" level=info msg="Extracting file run.sh to /var/lib/rancher/agent/work/20240216-143220/bf41949e2ec73ed144e2a2c31619f8e9e09fc8c6fdf3967a3add3b42932f15ab_0/run.sh"
Feb 16 14:32:22 ds-cen-kma004.hiddendomain.tld rancher-system-agent[3102]: time="2024-02-16T14:32:22+01:00" level=info msg="[Applyinator] Running command: sh [-c run.sh]"

Feb 16 14:32:24 ds-cen-kma004.hiddendomain.tld rancher-system-agent[3102]: time="2024-02-16T14:32:24+01:00" level=info msg="[Applyinator] Command sh [-c run.sh] finished with err: <nil> and exit code: 0"

Feb 16 14:32:24 ds-cen-kma004.hiddendomain.tld rancher-system-agent[3102]: time="2024-02-16T14:32:24+01:00" level=info msg="[Applyinator] No image provided, creating empty working directory /var/lib/rancher/agent/work/20240216-143220/bf41949e2ec73ed144e2a2c31619f8e9e09fc8c6fdf3967a3add3b42932f15ab_0"
Feb 16 14:32:24 ds-cen-kma004.hiddendomain.tld rancher-system-agent[3102]: time="2024-02-16T14:32:24+01:00" level=info msg="[Applyinator] Running command: sh [-c rke2 etcd-snapshot list --etcd-s3=false 2>/dev/null]"
Feb 16 14:32:24 ds-cen-kma004.hiddendomain.tld rancher-system-agent[3102]: time="2024-02-16T14:32:24+01:00" level=info msg="[Applyinator] Command sh [-c rke2 etcd-snapshot list --etcd-s3=false 2>/dev/null] finished with err: <nil> and exit code: 1"

I'm not sure why rancher is trying to list snapshots before rke2 is even installed and started, that doesn't seem right... but we should fix the issue that is causing the empty name file to be created and used by RKE2.

@brandond
Copy link
Member

brandond commented Feb 16, 2024

Thanks for the report!

This should be fixed for the March releases. It won't change the name on existing nodes, but it will set the node name properly on new nodes, and handle the weird name on nodes that are missing the hostname.

@VestigeJ
Copy link
Contributor

This is difficult to reproduce especially standalone even while spamming etcd commands during node startup. Even removing the hostname local variable from the cloud environments didn't surface this race condition.

Going to try to reproduce with Rancher provisioning.

@VestigeJ
Copy link
Contributor

##Environment Details

I tried reproducing this on k3s additionally but wasn't able to do so.

RANCHER_VERSIONS


Component | Version
-- | --
Rancher | v2.8-38d8ceb273be25ab304ed8c4144beda63bbcc27a-head
Dashboard | release-2.8 832a24819
Helm | v2.16.8-rancher2
Machine | v0.15.0-rancher110

Component	Version
[Rancher](https://github.com/rancher/rancher)	v2.8-38d8ceb273be25ab304ed8c4144beda63bbcc27a-head
[Dashboard](https://github.com/rancher/dashboard)	release-2.8 832a24819
[Helm](https://github.com/rancher/helm)	v2.16.8-rancher2
[Machine](https://github.com/rancher/machine)	v0.15.0-rancher110

Reproduced using VERSION=v1.25.16+rke2r1 deployed from latest rancher 2.8-head

Validated using same rancher instance with new KDM metadata config
VERSION=v1.26.15-rc1+rke2r1
VERSION=v1.27.12-rc1+rke2r1
VERSION=v1.28.8-rc1+rke2r1

technicality for the main branch for v1.29 because it's an unsupported version but I've elected to configure as if it's v1.28 with cilium CNI but edit as YAML to deploy an unsupported version for testing purposes.

Infrastructure

  • Cloud

Node(s) CPU architecture, OS, and version:

Linux 5.4.0-1041-aws x86_64 GNU/Linux
PRETTY_NAME="Ubuntu 20.04.2 LTS"

Cluster Configuration:

$ kgn -o wide

NAME                                     STATUS   ROLES                              AGE   VERSION           INTERNAL-IP     EXTERNAL-IP     OS-IMAGE             KERNEL-VERSION   CONTAINER-RUNTIME
justjustoldeadold-pool1-a8712137-2swpg   Ready    control-plane,etcd,master,worker   27m   v1.25.16+rke2r1   1.1.1.13        1.1.1.13           Ubuntu 20.04.2 LTS   5.4.0-1041-aws   containerd://1.7.7-k3s1
justjustoldeadold-pool1-a8712137-5mmq5   Ready    control-plane,etcd,master,worker   27m   v1.25.16+rke2r1   2.2.2.227       2.2.2.2            Ubuntu 20.04.2 LTS   5.4.0-1041-aws   containerd://1.7.7-k3s1
justjustoldeadold-pool1-a8712137-s4kr6   Ready    control-plane,etcd,master,worker   31m   v1.25.16+rke2r1   2.2.2.38        3.3.3.2             Ubuntu 20.04.2 LTS   5.4.0-1041-aws   containerd://1.7.7-k3s1

Config.yaml:

Pretty default with Cilium CNI

sudo cat /etc/rancher/rke2/config.yaml.d/50-rancher.yaml 
{
  "advertise-address": "1.1.1.13",
  "agent-token": "rsfr4jvcvmd7688888888888888888888888882npztg",
  "cni": "cilium",
  "disable-kube-proxy": false,
  "etcd-expose-metrics": false,
  "etcd-snapshot-retention": 5,
  "etcd-snapshot-schedule-cron": "0 */5 * * *",
  "kube-controller-manager-arg": [
    "cert-dir=/var/lib/rancher/rke2/server/tls/kube-controller-manager",
    "secure-port=10257"
  ],
  "kube-controller-manager-extra-mount": [
    "/var/lib/rancher/rke2/server/tls/kube-controller-manager:/var/lib/rancher/rke2/server/tls/kube-controller-manager"
  ],
  "kube-scheduler-arg": [
    "cert-dir=/var/lib/rancher/rke2/server/tls/kube-scheduler",
    "secure-port=10259"
  ],
  "kube-scheduler-extra-mount": [
    "/var/lib/rancher/rke2/server/tls/kube-scheduler:/var/lib/rancher/rke2/server/tls/kube-scheduler"
  ],
  "node-external-ip": [
    "2.2.2.2"
  ],
  "node-ip": [
    "2.2.2.2"
  ],
  "node-label": [
    "rke.cattle.io/machine=37bfffc9-e128-46fe-b583-738b3e99b958"
  ],
  "private-registry": "/etc/rancher/rke2/registries.yaml",
  "protect-kernel-defaults": false,
  "tls-san": [
    "1.1.1.130"
  ],
  "token": "xzb88888888888888888888888888888888889n4jrs9"

Steps

$ curl https://get.rke2.io --output install-"rke2".sh
$ sudo chmod +x install-"k3s".sh
$ sudo groupadd --system etcd && sudo useradd -s /sbin/nologin --system -g etcd etcd
$ sudo modprobe ip_vs_rr
$ sudo modprobe ip_vs_wrr
$ sudo modprobe ip_vs_sh
$ sudo printf "on_oovm.panic_on_oom=0 \nvm.overcommit_memory=1 \nkernel.panic=10 \nkernel.panic_ps=1 \nkernel.panic_on_oops=1 \n" > ~/60-rke2-cis.conf
$ sudo cp 60-rke2-cis.conf /etc/sysctl.d/
$ sudo systemctl restart systemd-sysctl
$ get_etcdctl //pasted below
$ get_etcd //pasted below
$ rke2 -v

Results:

$ get_etcd
+------------------+---------+-----------+----------------------------+----------------------------+------------+
|        ID        | STATUS  |   NAME    |         PEER ADDRS         |        CLIENT ADDRS        | IS LEARNER |
+------------------+---------+-----------+----------------------------+----------------------------+------------+
| 13a8cf8833d6853f | started | -7bd03d2b | https://1.1.1.13:2380      | https://1.1.1.13:2379      |      false |
| f695e96c242b1924 | started | -38cc4f29 | https://2.2.2.38:2380      |  https://2.2.2.38:2379    |      false |
| f80f8045ed175c9b | started | -8c15b6d4 | https://2.2.2.227:2380     | https://2.2.2.227:2379   |      false |
+------------------+---------+-----------+----------------------------+----------------------------+------------+
get_etcdctl() {
    has_bin curl
    # has_bin wget
    # wget https://github.com/etcd-io/etcd/releases/download/v3.5.0/etcd-v3.5.0-linux-amd64.tar.gz
    # wait
    # tar -xvzf etcd-v3.5.0-linux-amd64
    # sudo cp sudo cp etcd-v3.5.0-linux-amd64/etcd* /usr/local/bin/
    _etcd_version=v3.5.0
    # choose either URL
    # GOOGLE_URL=https://storage.googleapis.com/etcd
    _github_url=https://github.com/etcd-io/etcd/releases/download
    _download_url=${_github_url}

    rm -f /tmp/etcd-${_etcd_version}-linux-amd64.tar.gz
    rm -rf /tmp/etcd-download-test && mkdir -p /tmp/etcd-download-test
    curl -L ${_download_url}/${_etcd_version}/etcd-${_etcd_version}-linux-amd64.tar.gz -o /tmp/etcd-${_etcd_version}-linux-amd64.tar.gz
    tar xzvf /tmp/etcd-${_etcd_version}-linux-amd64.tar.gz -C /tmp/etcd-download-test --strip-components=1
    rm -f /tmp/etcd-${_etcd_version}-linux-amd64.tar.gz
    /tmp/etcd-download-test/etcd --version
    /tmp/etcd-download-test/etcdctl version
    /tmp/etcd-download-test/etcdutl version
    
}
get_etcd() {
    _product="${1:-rke2}"
    has_bin etcdctl
     sudo ETCDCTL_API=3 etcdctl \
    --cert /var/lib/rancher/"${_product}"/server/tls/etcd/server-client.crt \
    --key /var/lib/rancher/"${_product}"/server/tls/etcd/server-client.key \
    --endpoints https://127.0.0.1:2379 \
    --cacert /var/lib/rancher/"${_product}"/server/tls/etcd/server-ca.crt \
    member list -w table
    ##   https://etcd.io/docs/v3.5/tutorials/how-to-deal-with-membership/
}

$ get_etcd

+------------------+---------+-----------------------------------------------+---------------------------+---------------------------+------------+
|        ID        | STATUS  |                     NAME                      |        PEER ADDRS         |       CLIENT ADDRS        | IS LEARNER |
+------------------+---------+-----------------------------------------------+---------------------------+---------------------------+------------+
| 1720f870bfeaa43a | started | justjustrke2129-pool1-37307acf-kc5w7-7037e435 | https://1.1.1.109:2380    | https://1.1.1.109:2379    |      false |
+------------------+---------+-----------------------------------------------+---------------------------+---------------------------+------------+

$ rke2 -v

rke2 version v1.29.3-rc1+rke2r1 (0a03dad7063899effa8e2656cb74d1b1d51b103a)
go version go1.21.8 X:boringcrypto

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants