Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RKE2 Server listening on ipv6, but not ipv4 #4777

Closed
mhegreberg opened this issue Sep 20, 2023 · 11 comments
Closed

RKE2 Server listening on ipv6, but not ipv4 #4777

mhegreberg opened this issue Sep 20, 2023 · 11 comments

Comments

@mhegreberg
Copy link

Environmental Info:
RKE2 Version:
rke2 version v1.26.9+rke2r1 (368ba42666c9664d58bd0a9f7d3d13cd38f6267d) go version go1.20.8 X:boringcrypto

Node(s) CPU architecture, OS, and Version:
Linux Rancher1 6.1.0-12-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.52-1 (2023-09-07) x86_64 GNU/Linux

Cluster Configuration:
Trying to register a second server node, after initializing the first

Describe the bug:
The First server node (Rancher1) is listening for new node registration on ipv6 only, and not ipv4.

when I try to register the second node, I get:
Sep 20 09:03:25 Rancher2 rke2[16709]: time="2023-09-20T09:03:25-07:00" level=fatal msg="starting kubernetes: preparing server: failed to get CA certs: Get \"https://RANCHER.SERVER.URL:9345/cacerts\": dial tcp ip.address:9345: connect: no route to host"

I verified that the first node is only listening on ipv6:

root@Rancher1:/home/debian# netstat -tunlp
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN      638/sshd: /usr/sbin
tcp        0      0 ip.address:2380         0.0.0.0:*               LISTEN      28317/etcd
tcp        0      0 ip.address:2379         0.0.0.0:*               LISTEN      28317/etcd
tcp        0      0 127.0.0.1:10249         0.0.0.0:*               LISTEN      28724/kube-proxy
tcp        0      0 127.0.0.1:10248         0.0.0.0:*               LISTEN      28196/kubelet
tcp        0      0 127.0.0.1:10259         0.0.0.0:*               LISTEN      28546/kube-schedule
tcp        0      0 127.0.0.1:10258         0.0.0.0:*               LISTEN      28635/cloud-control
tcp        0      0 127.0.0.1:10257         0.0.0.0:*               LISTEN      28530/kube-controll
tcp        0      0 127.0.0.1:10256         0.0.0.0:*               LISTEN      28724/kube-proxy
tcp        0      0 127.0.0.1:2379          0.0.0.0:*               LISTEN      28317/etcd
tcp        0      0 127.0.0.1:2381          0.0.0.0:*               LISTEN      28317/etcd
tcp        0      0 127.0.0.1:2380          0.0.0.0:*               LISTEN      28317/etcd
tcp        0      0 127.0.0.1:10010         0.0.0.0:*               LISTEN      27802/containerd
tcp        0      0 172.18.73.79:9099       0.0.0.0:*               LISTEN      30570/calico-node
tcp6       0      0 :::22                   :::*                    LISTEN      638/sshd: /usr/sbin
tcp6       0      0 :::9091                 :::*                    LISTEN      30570/calico-node
tcp6       0      0 :::9345                 :::*                    LISTEN      27791/rke2 server
tcp6       0      0 :::10250                :::*                    LISTEN      28196/kubelet
tcp6       0      0 :::6443                 :::*                    LISTEN      28397/kube-apiserve
udp        0      0 0.0.0.0:8472            0.0.0.0:*                           -

trying to curl the cacerts via ipv4 manually fails, but if I run curl -k https://localhost:9345/cacerts from the first server, I get the cert, since it can route ipv6 with localhost

Steps To Reproduce:

  • Installed RKE2 on one node
  • added token to config and applied config to other node
  • tried to install RKE2 on second node

Expected behavior:
I would expect to be able to use ipv4 to register new nodes, as there is no ipv6 network between these servers

Actual behavior:
it seems I can only register new nodes via ipv6.

Additional context / logs:
It's entirely possible I'm missing some configuration setting, and rke2 is defaulting to ipv6 only, but I see nothing in either the rancher nor the rke2 docs.

I've been using these as reference:

https://ranchermanager.docs.rancher.com/how-to-guides/new-user-guides/kubernetes-cluster-setup/rke2-for-rancher

https://docs.rke2.io/install/ha

@manuelbuil
Copy link
Contributor

Thanks for creating the issue! Could you share your config for the server? Did you set node-ip or advertise-address or any other network config?

@brandond
Copy link
Member

brandond commented Sep 20, 2023

netstat shows :::9345 for dual-stack listeners when the port is bound using INADDR_ANY. This does not mean that it is only listening on IPv6.

failed to get CA certs: Get \"https://RANCHER.SERVER.URL:9345/cacerts\": dial tcp ip.address:9345: connect: no route to host"

Is ip.address in your sanitized output the ipv4, or ipv6 address? It is very hard to tell what's actually going on with highly redacted logs.

Please confirm that you have both A and AAAA records for RANCHER.SERVER.URL, or only A records if you want it to use IPv4. What is the output of dig RANCHER.SERVER.URL any on the node you're trying to join?

@mhegreberg
Copy link
Author

Hey!
here's my config:

token: tokenvalue
server: https://rancher.server.url:9345
tls-san:
  - rancher.server.url
  - rancher
  - rancher1
  - 172.ipv4.of.rancher1
  - rancher2
  - 172.ipv4.of.rancher2
  - rancher3
  - 172.ipv4.of.rancher3

the rancher server url has an A record as I'm just using ipv4.

dig results:

root@Rancher2:/etc/rancher/rke2# dig rancher.server.url any

; <<>> DiG 9.18.16-1~deb12u1-Debian <<>> rancher.server.url any
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 46436
;; flags: qr aa rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4000
;; QUESTION SECTION:
;rancher.server.url. IN      ANY

;; ANSWER SECTION:
rancher.server.url. 3600 IN  A       172.ipv4.of.rancher1

;; Query time: 0 msec
;; SERVER: 172.18.73.10#53(172.18.73.10) (TCP)
;; WHEN: Wed Sep 20 09:59:00 PDT 2023
;; MSG SIZE  rcvd: 74

@brandond
Copy link
Member

brandond commented Sep 20, 2023

Can you confirm that you can curl -vks https://172.ipv4.of.rancher1:9345 from both that server, and the node you're trying to join?

Can you confirm that you've opened that port on any firewalls between the two nodes, and disabled any local firewall (firewalld/ufw) on that server?

@mhegreberg
Copy link
Author

from the first node(rancher1) I can do curl -vks https://rancher.server.url:9345/cacerts and I get back the certificate.(same thing with both the IP and url)

just curling 172.ipv4.of.rancher1:9345 works, but I receive a 404

from the node I'm trying to join it fails:

*   Trying 172.ip.of.rancher1:9345...
* connect to 12.ip.of.rancher1 port 9345 failed: No route to host
* Failed to connect to rancher.server.url port 9345 after 2 ms: Couldn't connect to server
* Closing connection 0

all firewalls between the two nodes (local/network) are currently disabled. all traffic should be open

@mhegreberg
Copy link
Author

it's worth noting that I can curl https://rancher.server.url without the port from rancher2 (the node I'm trying to join) and I get a 404, so I can successfully reach the server, just not on port 9345

@mhegreberg
Copy link
Author

alright. I found the issue. It turns out that even though ufw was disabled, firewalld was running. not sure why it was setup this way.

I was able to get past that, but the new node(rancher2) is now stuck at:

Sep 20 10:50:01 Rancher2 rke2[21326]: time="2023-09-20T10:50:01-07:00" level=info msg="Waiting for etcd server to become available"
Sep 20 10:50:01 Rancher2 rke2[21326]: time="2023-09-20T10:50:01-07:00" level=info msg="Waiting for API server to become available"
Sep 20 10:50:05 Rancher2 rke2[21326]: time="2023-09-20T10:50:05-07:00" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: https://127.0.0.1:9345/v1-rke2/readyz: 500 Internal Server Error"

is this usually a long process? the first node only took a minute or so, and this has been going on loop for while

@brandond
Copy link
Member

brandond commented Sep 20, 2023

It appears to be waiting on etcd to start. Check the etcd pod logs under /var/log/pods. Can you confirm that you've also opened all the etcd ports between the nodes? https://docs.rke2.io/install/requirements#inbound-network-rules

@mhegreberg
Copy link
Author

I manually added all those rules to both nodes.

the etcd log shows

2023-09-20T11:25:59.846642269-07:00 stderr F {"level":"warn","ts":"2023-09-20T18:25:59.846508Z","caller":"etcdserver/server.go:2085","msg":"failed to publish local member to cluster through raft","local-member-id":"5b7a12c7d16c0ba4","local-member-attributes":"{Name:rancher2-067cd654 ClientURLs:[https://172.ipv4.of.rancher2:2379]}","request-path":"/0/members/5b7a12c7d16c0ba4/attributes","publish-timeout":"15s","error":"etcdserver: request timed out"}`

@brandond
Copy link
Member

brandond commented Sep 20, 2023

On rancher2, try running rke2-killall.sh, then restarting the rke2-server service.

If that doesn't fix it, I would probably use kubectl to delete the rancher2 node from the cluster, then uninstall and reinstall RKE2 on rancher2, after confirming that you've opened up the firewall on both nodes to all the ports and ranges listed on that page, or (preferably) disabled it entirely.

@brandond
Copy link
Member

I'm moving this to a discussion instead of an issue, as it is becoming clear that there were just some missed prerequisites, and there is not anything wrong with rke2.

@rancher rancher locked and limited conversation to collaborators Sep 20, 2023
@brandond brandond converted this issue into discussion #4778 Sep 20, 2023

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants