Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate endpoints prevents any endpoint from being used #5577

Closed
dmessano opened this issue Mar 7, 2024 · 2 comments
Closed

Duplicate endpoints prevents any endpoint from being used #5577

dmessano opened this issue Mar 7, 2024 · 2 comments
Assignees

Comments

@dmessano
Copy link

dmessano commented Mar 7, 2024

Environmental Info:
RKE2 Version:
rke2 version v1.27.10+rke2r1 (915672b)
go version go1.20.13 X:boringcrypto

Node(s) CPU architecture, OS, and Version:
Linux node01 4.18.0-477.10.1.el8_8.x86_64 #1 SMP Wed Apr 5 13:35:01 EDT 2023 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux release 8.8 (Ootpa)

Cluster Configuration:
9 Cluster Members - 3 Server 6 Worker

Describe the bug:
Including a duplicate private repo causes an error and no private mirror is used, only the fallback endpoint.

Steps To Reproduce:

  • Install RHEL 8.8
  • Disabled SELinux
  • Disabled firewalld
  • Created rke2-canal.conf file for NetworkManager
  • Installed RKE2 via Install.sh with rke2.linux-amd64.tar.gz, rke2-images.linux-amd64.tar.zst, sha256sum-amd64.txt, on a closed network - no internet access - local only
  • Set /etc/rancher/rke2/registries.yaml
mirrors:
 docker.io:
   endpoint:
     - "harbor.main.company.com"
     - "harbor.old.company.com"
     - "harbor.dev.company.com"
     - "harbor.old.company.com"
 quay.io:
   endpoint:
     - "harbor.main.company.com"
     - "harbor.old.company.com"
     - "harbor.dev.company.com"
     - "harbor.old.company.com"
 k8s.gcr.io:
   endpoint:
     - "harbor.main.company.com"
     - "harbor.old.company.com"
     - "harbor.dev.company.com"
     - "harbor.old.company.com"
configs:
 "docker.io":
   tls:
     insecure_skip_verify: true
 "quay.io":
   tls:
     insecure_skip_verify: true
 "k8s.gcr.io":
   tls:
     insecure_skip_verify: true
  • systemctl stop rke2-server
  • systemctl start rke2-server
  • Deploy something that needs to pull an image and wait for ImagePullBackOff

Check logs and there is an error about duplicate endpoints
*level=error msg="failed to decode hosts.toml" error="failed to parse TOML: (24, 2): duplicated tables"*

Expected behavior:
Try the duplicate endpoint or skip over it

Actual behavior:
Only the default endpoint fallback is processed, none of the private registries are tried resulting in ImagePullBackOff for all workloads
We have left a cluster in this state for three hours. After that we deleted the pods many times and received the exact same result

Additional context / logs:

time="2024-03-07T18:45:39.875620419Z" level=info msg="PullImage \"ghcr.io/kube-vip/kube-vip-cloud-provider:v0.0.7\""
time="2024-03-07T18:45:39.875970645Z" level=error msg="failed to decode hosts.toml" error="failed to parse TOML: (24, 2): duplicated tables"                                                                                              
time="2024-03-07T18:45:43.878647091Z" level=info msg="trying next host" error="failed to do request: Head \"https://ghcr.io/v2/kube-vip/kube-vip-cloud-provider/manifests/v0.0.7\": dial tcp: lookup ghcr.io on 10.10.7.5:53: server misbehaving" host=ghcr.io
time="2024-03-07T18:45:43.878807261Z" level=error msg="PullImage \"ghcr.io/kube-vip/kube-vip-cloud-provider:v0.0.7\" failed" error="failed to pull and unpack image \"ghcr.io/kube-vip/kube-vip-cloud-provider:v0.0.7\": failed to resolve
reference \"ghcr.io/kube-vip/kube-vip-cloud-provider:v0.0.7\": failed to do request: Head \"https://ghcr.io/v2/kube-vip/kube-vip-cloud-provider/manifests/v0.0.7\": dial tcp: lookup ghcr.io on 10.10.7.5:53: server misbehaving"         
time="2024-03-07T18:45:43.878853331Z" level=info msg="stop pulling image ghcr.io/kube-vip/kube-vip-cloud-provider:v0.0.7: active requests=0, bytes read=0"                                                                                
time="2024-03-07T18:45:57.940686660Z" level=info msg="PullImage \"ghcr.io/kube-vip/kube-vip-cloud-provider:v0.0.7\""
time="2024-03-07T18:45:57.941043822Z" level=error msg="failed to decode hosts.toml" error="failed to parse TOML: (24, 2): duplicated tables"                                                                                              
time="2024-03-07T18:46:01.943919498Z" level=info msg="trying next host" error="failed to do request: Head \"https://ghcr.io/v2/kube-vip/kube-vip-cloud-provider/manifests/v0.0.7\": dial tcp: lookup ghcr.io on 10.10.7.5:53: server misbehaving" host=ghcr.io
time="2024-03-07T18:46:01.944077639Z" level=error msg="PullImage \"ghcr.io/kube-vip/kube-vip-cloud-provider:v0.0.7\" failed" error="failed to pull and unpack image \"ghcr.io/kube-vip/kube-vip-cloud-provider:v0.0.7\": failed to resolve
reference \"ghcr.io/kube-vip/kube-vip-cloud-provider:v0.0.7\": failed to do request: Head \"https://ghcr.io/v2/kube-vip/kube-vip-cloud-provider/manifests/v0.0.7\": dial tcp: lookup ghcr.io on 10.10.7.5:53: server misbehaving"         
time="2024-03-07T18:46:01.944119873Z" level=info msg="stop pulling image ghcr.io/kube-vip/kube-vip-cloud-provider:v0.0.7: active requests=0, bytes read=0"                                                                                
time="2024-03-07T18:46:25.939879033Z" level=info msg="PullImage \"ghcr.io/kube-vip/kube-vip-cloud-provider:v0.0.7\""
time="2024-03-07T18:46:25.940314528Z" level=error msg="failed to decode hosts.toml" error="failed to parse TOML: (24, 2): duplicated tables"                                                                                              
time="2024-03-07T18:46:29.943185882Z" level=info msg="trying next host" error="failed to do request: Head \"https://ghcr.io/v2/kube-vip/kube-vip-cloud-provider/manifests/v0.0.7\": dial tcp: lookup ghcr.io on 10.10.7.5:53: server misbehaving" host=ghcr.io
time="2024-03-07T18:46:29.943347620Z" level=error msg="PullImage \"ghcr.io/kube-vip/kube-vip-cloud-provider:v0.0.7\" failed" error="failed to pull and unpack image \"ghcr.io/kube-vip/kube-vip-cloud-provider:v0.0.7\": failed to resolve
reference \"ghcr.io/kube-vip/kube-vip-cloud-provider:v0.0.7\": failed to do request: Head \"https://ghcr.io/v2/kube-vip/kube-vip-cloud-provider/manifests/v0.0.7\": dial tcp: lookup ghcr.io on 10.10.7.5:53: server misbehaving"         
time="2024-03-07T18:46:29.943390961Z" level=info msg="stop pulling image ghcr.io/kube-vip/kube-vip-cloud-provider:v0.0.7: active requests=0, bytes read=0"                                                                                
time="2024-03-07T18:47:22.939896237Z" level=info msg="PullImage \"ghcr.io/kube-vip/kube-vip-cloud-provider:v0.0.7\""
time="2024-03-07T18:47:22.940262036Z" level=error msg="failed to decode hosts.toml" error="failed to parse TOML: (24, 2): duplicated tables"                                                                                              
time="2024-03-07T18:47:26.943287957Z" level=info msg="trying next host" error="failed to do request: Head \"https://ghcr.io/v2/kube-vip/kube-vip-cloud-provider/manifests/v0.0.7\": dial tcp: lookup ghcr.io on 10.10.7.5:53: server misbehaving" host=ghcr.io
time="2024-03-07T18:47:26.943482751Z" level=error msg="PullImage \"ghcr.io/kube-vip/kube-vip-cloud-provider:v0.0.7\" failed" error="failed to pull and unpack image \"ghcr.io/kube-vip/kube-vip-cloud-provider:v0.0.7\": failed to resolve
reference \"ghcr.io/kube-vip/kube-vip-cloud-provider:v0.0.7\": failed to do request: Head \"https://ghcr.io/v2/kube-vip/kube-vip-cloud-provider/manifests/v0.0.7\": dial tcp: lookup ghcr.io on 10.10.7.5:53: server misbehaving"         
time="2024-03-07T18:47:26.943526226Z" level=info msg="stop pulling image ghcr.io/kube-vip/kube-vip-cloud-provider:v0.0.7: active requests=0, bytes read=0" 

.....

@brandond brandond changed the title Duplicate endpoints prevents any enpoint from being used Duplicate endpoints prevents any endpoint from being used Mar 7, 2024
@brandond
Copy link
Member

brandond commented Mar 7, 2024

Tracked in k3s as k3s-io/k3s#9693

@aganesh-suse
Copy link

Validated on master branch with commit 109f70b

Environment Details

Infrastructure

  • Cloud
  • Hosted

Node(s) CPU architecture, OS, and Version:

$ cat /etc/os-release
PRETTY_NAME="Ubuntu 22.04.2 LTS"

$ uname -m
x86_64

Cluster Configuration:

HA : 3 server / 1 agent

or

1 server/ 1 agent

Config.yaml:

token: xxxx
write-kubeconfig-mode: "0644"
node-external-ip: 1.1.1.1
debug: true

Additional files

registries.yaml

mirrors:
  docker.io:
    endpoint:
      - https://registry.example.com
      - https://registry.example.com 

Testing Steps

  1. Copy config.yaml
$ sudo mkdir -p /etc/rancher/rke2 && sudo cp config.yaml /etc/rancher/rke2
  1. Install RKE2
curl -sfL https://get.rke2.io | sudo INSTALL_RKE2_COMMIT='109f70b74598feac41aeadf45a8144d363e3c050' INSTALL_RKE2_TYPE='server' INSTALL_RKE2_METHOD=tar sh -
  1. Start the RKE2 service
$ sudo systemctl enable --now rke2-server
or 
$ sudo systemctl enable --now rke2-agent
  1. Verify Cluster Status:
kubectl get nodes -o wide
kubectl get pods -A
  1. Check hosts.toml contents:
sudo cat /var/lib/rancher/rke2/agent/etc/containerd/certs.d/docker.io/hosts.toml

Check containerd logs for errors:

sudo cat /var/lib/rancher/rke2/agent/containerd/containerd.log | grep error | grep TOML on server1

Replication Results:

  • rke2 version used for replication:
$ rke2 -v
rke2 version v1.29.2+rke2r1 (08699dfffdf75a61a5e6064f9f8efe8ddae857fe)
go version go1.21.7 X:boringcrypto
$ sudo cat /var/lib/rancher/rke2/agent/etc/containerd/certs.d/docker.io/hosts.toml
# File generated by rke2. DO NOT EDIT.
server = "https://registry-1.docker.io/v2"

[host."https://registry.example.com/v2"]
  capabilities = ["pull", "resolve"]
  ca = ["/home/ubuntu/ca.pem"]

[host."https://registry.example.com/v2"]
  capabilities = ["pull", "resolve"]
  ca = ["/home/ubuntu/ca.pem"]
$ sudo cat /var/lib/rancher/rke2/agent/containerd/containerd.log | grep error | grep TOML on server1
time="2024-03-15T02:15:17.509832797Z" level=error msg="failed to decode hosts.toml" error="failed to parse TOML: (8, 2): duplicated tables"
time="2024-03-15T02:15:17.660226037Z" level=error msg="failed to decode hosts.toml" error="failed to parse TOML: (8, 2): duplicated tables"

Validation Results:

  • rke2 version used for validation:
$ rke2 -v
rke2 version v1.29.2+dev.109f70b7 (109f70b74598feac41aeadf45a8144d363e3c050)
go version go1.21.7 X:boringcrypto
$ sudo cat /var/lib/rancher/rke2/agent/etc/containerd/certs.d/docker.io/hosts.toml
# File generated by rke2. DO NOT EDIT.

server = "https://registry-1.docker.io/v2"
capabilities = ["pull", "resolve", "push"]



[host."https://registry.example.com/v2"]
  capabilities = ["pull", "resolve"]
$ sudo cat /var/lib/rancher/rke2/agent/containerd/containerd.log | grep error | grep TOML on server1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants