Duplicate endpoints prevents any endpoint from being used #5577

dmessano · 2024-03-07T21:32:17Z

Environmental Info:
RKE2 Version:
rke2 version v1.27.10+rke2r1 (915672b)
go version go1.20.13 X:boringcrypto

Node(s) CPU architecture, OS, and Version:
Linux node01 4.18.0-477.10.1.el8_8.x86_64 #1 SMP Wed Apr 5 13:35:01 EDT 2023 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux release 8.8 (Ootpa)

Cluster Configuration:
9 Cluster Members - 3 Server 6 Worker

Describe the bug:
Including a duplicate private repo causes an error and no private mirror is used, only the fallback endpoint.

Steps To Reproduce:

Install RHEL 8.8
Disabled SELinux
Disabled firewalld
Created rke2-canal.conf file for NetworkManager
Installed RKE2 via Install.sh with rke2.linux-amd64.tar.gz, rke2-images.linux-amd64.tar.zst, sha256sum-amd64.txt, on a closed network - no internet access - local only
Set /etc/rancher/rke2/registries.yaml

mirrors:
 docker.io:
   endpoint:
     - "harbor.main.company.com"
     - "harbor.old.company.com"
     - "harbor.dev.company.com"
     - "harbor.old.company.com"
 quay.io:
   endpoint:
     - "harbor.main.company.com"
     - "harbor.old.company.com"
     - "harbor.dev.company.com"
     - "harbor.old.company.com"
 k8s.gcr.io:
   endpoint:
     - "harbor.main.company.com"
     - "harbor.old.company.com"
     - "harbor.dev.company.com"
     - "harbor.old.company.com"
configs:
 "docker.io":
   tls:
     insecure_skip_verify: true
 "quay.io":
   tls:
     insecure_skip_verify: true
 "k8s.gcr.io":
   tls:
     insecure_skip_verify: true

systemctl stop rke2-server
systemctl start rke2-server
Deploy something that needs to pull an image and wait for ImagePullBackOff

Check logs and there is an error about duplicate endpoints
*level=error msg="failed to decode hosts.toml" error="failed to parse TOML: (24, 2): duplicated tables"*

Expected behavior:
Try the duplicate endpoint or skip over it

Actual behavior:
Only the default endpoint fallback is processed, none of the private registries are tried resulting in ImagePullBackOff for all workloads
We have left a cluster in this state for three hours. After that we deleted the pods many times and received the exact same result

Additional context / logs:

time="2024-03-07T18:45:39.875620419Z" level=info msg="PullImage \"ghcr.io/kube-vip/kube-vip-cloud-provider:v0.0.7\""
time="2024-03-07T18:45:39.875970645Z" level=error msg="failed to decode hosts.toml" error="failed to parse TOML: (24, 2): duplicated tables"                                                                                              
time="2024-03-07T18:45:43.878647091Z" level=info msg="trying next host" error="failed to do request: Head \"https://ghcr.io/v2/kube-vip/kube-vip-cloud-provider/manifests/v0.0.7\": dial tcp: lookup ghcr.io on 10.10.7.5:53: server misbehaving" host=ghcr.io
time="2024-03-07T18:45:43.878807261Z" level=error msg="PullImage \"ghcr.io/kube-vip/kube-vip-cloud-provider:v0.0.7\" failed" error="failed to pull and unpack image \"ghcr.io/kube-vip/kube-vip-cloud-provider:v0.0.7\": failed to resolve
reference \"ghcr.io/kube-vip/kube-vip-cloud-provider:v0.0.7\": failed to do request: Head \"https://ghcr.io/v2/kube-vip/kube-vip-cloud-provider/manifests/v0.0.7\": dial tcp: lookup ghcr.io on 10.10.7.5:53: server misbehaving"         
time="2024-03-07T18:45:43.878853331Z" level=info msg="stop pulling image ghcr.io/kube-vip/kube-vip-cloud-provider:v0.0.7: active requests=0, bytes read=0"                                                                                
time="2024-03-07T18:45:57.940686660Z" level=info msg="PullImage \"ghcr.io/kube-vip/kube-vip-cloud-provider:v0.0.7\""
time="2024-03-07T18:45:57.941043822Z" level=error msg="failed to decode hosts.toml" error="failed to parse TOML: (24, 2): duplicated tables"                                                                                              
time="2024-03-07T18:46:01.943919498Z" level=info msg="trying next host" error="failed to do request: Head \"https://ghcr.io/v2/kube-vip/kube-vip-cloud-provider/manifests/v0.0.7\": dial tcp: lookup ghcr.io on 10.10.7.5:53: server misbehaving" host=ghcr.io
time="2024-03-07T18:46:01.944077639Z" level=error msg="PullImage \"ghcr.io/kube-vip/kube-vip-cloud-provider:v0.0.7\" failed" error="failed to pull and unpack image \"ghcr.io/kube-vip/kube-vip-cloud-provider:v0.0.7\": failed to resolve
reference \"ghcr.io/kube-vip/kube-vip-cloud-provider:v0.0.7\": failed to do request: Head \"https://ghcr.io/v2/kube-vip/kube-vip-cloud-provider/manifests/v0.0.7\": dial tcp: lookup ghcr.io on 10.10.7.5:53: server misbehaving"         
time="2024-03-07T18:46:01.944119873Z" level=info msg="stop pulling image ghcr.io/kube-vip/kube-vip-cloud-provider:v0.0.7: active requests=0, bytes read=0"                                                                                
time="2024-03-07T18:46:25.939879033Z" level=info msg="PullImage \"ghcr.io/kube-vip/kube-vip-cloud-provider:v0.0.7\""
time="2024-03-07T18:46:25.940314528Z" level=error msg="failed to decode hosts.toml" error="failed to parse TOML: (24, 2): duplicated tables"                                                                                              
time="2024-03-07T18:46:29.943185882Z" level=info msg="trying next host" error="failed to do request: Head \"https://ghcr.io/v2/kube-vip/kube-vip-cloud-provider/manifests/v0.0.7\": dial tcp: lookup ghcr.io on 10.10.7.5:53: server misbehaving" host=ghcr.io
time="2024-03-07T18:46:29.943347620Z" level=error msg="PullImage \"ghcr.io/kube-vip/kube-vip-cloud-provider:v0.0.7\" failed" error="failed to pull and unpack image \"ghcr.io/kube-vip/kube-vip-cloud-provider:v0.0.7\": failed to resolve
reference \"ghcr.io/kube-vip/kube-vip-cloud-provider:v0.0.7\": failed to do request: Head \"https://ghcr.io/v2/kube-vip/kube-vip-cloud-provider/manifests/v0.0.7\": dial tcp: lookup ghcr.io on 10.10.7.5:53: server misbehaving"         
time="2024-03-07T18:46:29.943390961Z" level=info msg="stop pulling image ghcr.io/kube-vip/kube-vip-cloud-provider:v0.0.7: active requests=0, bytes read=0"                                                                                
time="2024-03-07T18:47:22.939896237Z" level=info msg="PullImage \"ghcr.io/kube-vip/kube-vip-cloud-provider:v0.0.7\""
time="2024-03-07T18:47:22.940262036Z" level=error msg="failed to decode hosts.toml" error="failed to parse TOML: (24, 2): duplicated tables"                                                                                              
time="2024-03-07T18:47:26.943287957Z" level=info msg="trying next host" error="failed to do request: Head \"https://ghcr.io/v2/kube-vip/kube-vip-cloud-provider/manifests/v0.0.7\": dial tcp: lookup ghcr.io on 10.10.7.5:53: server misbehaving" host=ghcr.io
time="2024-03-07T18:47:26.943482751Z" level=error msg="PullImage \"ghcr.io/kube-vip/kube-vip-cloud-provider:v0.0.7\" failed" error="failed to pull and unpack image \"ghcr.io/kube-vip/kube-vip-cloud-provider:v0.0.7\": failed to resolve
reference \"ghcr.io/kube-vip/kube-vip-cloud-provider:v0.0.7\": failed to do request: Head \"https://ghcr.io/v2/kube-vip/kube-vip-cloud-provider/manifests/v0.0.7\": dial tcp: lookup ghcr.io on 10.10.7.5:53: server misbehaving"         
time="2024-03-07T18:47:26.943526226Z" level=info msg="stop pulling image ghcr.io/kube-vip/kube-vip-cloud-provider:v0.0.7: active requests=0, bytes read=0"

.....

The text was updated successfully, but these errors were encountered:

brandond · 2024-03-07T23:47:49Z

Tracked in k3s as k3s-io/k3s#9693

aganesh-suse · 2024-03-15T21:03:49Z

Validated on master branch with commit `109f70b`

Environment Details

Infrastructure

Cloud
Hosted

Node(s) CPU architecture, OS, and Version:

$ cat /etc/os-release
PRETTY_NAME="Ubuntu 22.04.2 LTS"

$ uname -m
x86_64

Cluster Configuration:

HA : 3 server / 1 agent

or

1 server/ 1 agent

Config.yaml:

token: xxxx
write-kubeconfig-mode: "0644"
node-external-ip: 1.1.1.1
debug: true

Additional files

registries.yaml

mirrors:
  docker.io:
    endpoint:
      - https://registry.example.com
      - https://registry.example.com

Testing Steps

Copy config.yaml

$ sudo mkdir -p /etc/rancher/rke2 && sudo cp config.yaml /etc/rancher/rke2

Install RKE2

curl -sfL https://get.rke2.io | sudo INSTALL_RKE2_COMMIT='109f70b74598feac41aeadf45a8144d363e3c050' INSTALL_RKE2_TYPE='server' INSTALL_RKE2_METHOD=tar sh -

Start the RKE2 service

$ sudo systemctl enable --now rke2-server
or 
$ sudo systemctl enable --now rke2-agent

Verify Cluster Status:

kubectl get nodes -o wide
kubectl get pods -A

Check hosts.toml contents:

sudo cat /var/lib/rancher/rke2/agent/etc/containerd/certs.d/docker.io/hosts.toml

Check containerd logs for errors:

sudo cat /var/lib/rancher/rke2/agent/containerd/containerd.log | grep error | grep TOML on server1

Replication Results:

rke2 version used for replication:

$ rke2 -v
rke2 version v1.29.2+rke2r1 (08699dfffdf75a61a5e6064f9f8efe8ddae857fe)
go version go1.21.7 X:boringcrypto

$ sudo cat /var/lib/rancher/rke2/agent/etc/containerd/certs.d/docker.io/hosts.toml
# File generated by rke2. DO NOT EDIT.
server = "https://registry-1.docker.io/v2"

[host."https://registry.example.com/v2"]
  capabilities = ["pull", "resolve"]
  ca = ["/home/ubuntu/ca.pem"]

[host."https://registry.example.com/v2"]
  capabilities = ["pull", "resolve"]
  ca = ["/home/ubuntu/ca.pem"]

$ sudo cat /var/lib/rancher/rke2/agent/containerd/containerd.log | grep error | grep TOML on server1
time="2024-03-15T02:15:17.509832797Z" level=error msg="failed to decode hosts.toml" error="failed to parse TOML: (8, 2): duplicated tables"
time="2024-03-15T02:15:17.660226037Z" level=error msg="failed to decode hosts.toml" error="failed to parse TOML: (8, 2): duplicated tables"

Validation Results:

rke2 version used for validation:

$ rke2 -v
rke2 version v1.29.2+dev.109f70b7 (109f70b74598feac41aeadf45a8144d363e3c050)
go version go1.21.7 X:boringcrypto

$ sudo cat /var/lib/rancher/rke2/agent/etc/containerd/certs.d/docker.io/hosts.toml
# File generated by rke2. DO NOT EDIT.

server = "https://registry-1.docker.io/v2"
capabilities = ["pull", "resolve", "push"]



[host."https://registry.example.com/v2"]
  capabilities = ["pull", "resolve"]

$ sudo cat /var/lib/rancher/rke2/agent/containerd/containerd.log | grep error | grep TOML on server1

brandond changed the title ~~Duplicate endpoints prevents any enpoint from being used~~ Duplicate endpoints prevents any endpoint from being used Mar 7, 2024

brandond self-assigned this Mar 7, 2024

brandond added this to the v1.29.3+rke2r1 milestone Mar 7, 2024

endawkins assigned aganesh-suse Mar 14, 2024

aganesh-suse closed this as completed Mar 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Duplicate endpoints prevents any endpoint from being used #5577

Duplicate endpoints prevents any endpoint from being used #5577

dmessano commented Mar 7, 2024

brandond commented Mar 7, 2024

aganesh-suse commented Mar 15, 2024

Duplicate endpoints prevents any endpoint from being used #5577

Duplicate endpoints prevents any endpoint from being used #5577

Comments

dmessano commented Mar 7, 2024

brandond commented Mar 7, 2024

aganesh-suse commented Mar 15, 2024

Validated on master branch with commit 109f70b

Environment Details

Testing Steps

Validated on master branch with commit `109f70b`