-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Core DNS missing NodeHosts key in Configmap #9274
Comments
The NodeHosts key is maintained by a controller that runs in the k3s server process. It ensures that the key exists, and contains a hosts file entry for every node in the cluster. Confirm that you see nodes listed in If you don't find anything useful, please attach the output of |
If the k3s is uninstalled and installed couple of times, The problem start to appear and persist after that. In the journal logs I can see that K3s nodes YAML: apiVersion: v1
items:
- apiVersion: v1
kind: Node
metadata:
annotations:
alpha.kubernetes.io/provided-node-ip: 10.54.65.165
etcd.k3s.cattle.io/node-address: 10.54.65.165
etcd.k3s.cattle.io/node-name: vm-165.tests.lab.net-03b3f407
flannel.alpha.coreos.com/backend-data: '{"VNI":1,"VtepMAC":"3e:40:74:1b:49:7e"}'
flannel.alpha.coreos.com/backend-type: vxlan
flannel.alpha.coreos.com/kube-subnet-manager: "true"
flannel.alpha.coreos.com/public-ip: 10.54.65.165
k3s.io/encryption-config-hash: start-f34e200ca9e32f01069b791f6654af685af5487c329e8935fc45eda80502e68d
k3s.io/external-ip: 10.54.65.165
k3s.io/hostname: vm-165.tests.lab.net
k3s.io/internal-ip: 10.54.65.165
k3s.io/node-args: '["server","--cluster-init","true","--disable","traefik","--default-local-storage-path","/var/lib/rancher/storage","--disable-helm-controller","true","--https-listen-port","6443","--kube-apiserver-arg","audit-log-path=/var/lib/rancher/k3s/server/logs/audit.log","--kube-apiserver-arg","audit-policy-file=/var/lib/rancher/k3s/server/audit.yaml","--kube-apiserver-arg","audit-log-maxage=30","--kube-apiserver-arg","audit-log-maxbackup=10","--kube-apiserver-arg","audit-log-maxsize=100","--kube-apiserver-arg","request-timeout=300s","--kube-apiserver-arg","service-account-lookup=true","--kube-apiserver-arg","anonymous-auth=false","--kube-cloud-controller-manager-arg","webhook-secure-port=0","--kube-controller-manager-arg","terminated-pod-gc-threshold=12500","--kube-controller-manager-arg","use-service-account-credentials=true","--kubelet-arg","container-log-max-files=4","--kubelet-arg","container-log-max-size=50Mi","--kubelet-arg","streaming-connection-idle-timeout=5m","--kubelet-arg","make-iptables-util-chains=true","--node-external-ip","10.54.65.165","--node-name","vm-165.tests.lab.net","--prefer-bundled-bin","true","--protect-kernel-defaults","true","--secrets-encryption","true","--tls-san","10.54.65.165","--write-kubeconfig-mode","0600"]'
k3s.io/node-config-hash: A34B3T7OUQMIDGJXFI3KAYFMLFUNHUG7VU5PZZMBY5NBN3M4YQHQ====
k3s.io/node-env: '{"K3S_DATA_DIR":"/var/lib/rancher/k3s/data/28f7e87eba734b7f7731dc900e2c84e0e98ce869f3dcf57f65dc7bbb80e12e56","K3S_INTERNAL_CERTS_EXPIRATION_DAYS":"730","K3S_UPGRADE":"false"}'
node.alpha.kubernetes.io/ttl: "0"
volumes.kubernetes.io/controller-managed-attach-detach: "true"
finalizers:
- wrangler.cattle.io/node
- wrangler.cattle.io/managed-etcd-controller
labels:
beta.kubernetes.io/arch: amd64
beta.kubernetes.io/instance-type: k3s
beta.kubernetes.io/os: linux
kubernetes.io/arch: amd64
kubernetes.io/hostname: vm-165.tests.lab.net
kubernetes.io/os: linux
node-role.kubernetes.io/control-plane: "true"
node-role.kubernetes.io/etcd: "true"
node-role.kubernetes.io/master: "true"
node.kubernetes.io/instance-type: k3s
name: vm-165.tests.lab.net
resourceVersion: "10767"
uid: 4f4463ab-3e4e-44dc-a684-7af8d16920f7
spec:
podCIDR: 10.42.0.0/24
podCIDRs:
- 10.42.0.0/24
providerID: k3s://vm-165.tests.lab.net
status:
addresses:
- address: 10.54.65.165
type: InternalIP
- address: 10.54.65.165
type: ExternalIP
- address: vm-165.tests.lab.net
type: Hostname
allocatable:
cpu: "16"
ephemeral-storage: "202891775022"
hugepages-1Gi: "0"
hugepages-2Mi: "0"
memory: 32649356Ki
pods: "110"
capacity:
cpu: "16"
ephemeral-storage: 208564736Ki
hugepages-1Gi: "0"
hugepages-2Mi: "0"
memory: 32649356Ki
pods: "110"
conditions:
- lastHeartbeatTime: "2024-01-23T14:54:11Z"
lastTransitionTime: "2024-01-23T13:52:55Z"
message: kubelet has sufficient memory available
reason: KubeletHasSufficientMemory
status: "False"
type: MemoryPressure
- lastHeartbeatTime: "2024-01-23T14:54:11Z"
lastTransitionTime: "2024-01-23T13:52:55Z"
message: kubelet has no disk pressure
reason: KubeletHasNoDiskPressure
status: "False"
type: DiskPressure
- lastHeartbeatTime: "2024-01-23T14:54:11Z"
lastTransitionTime: "2024-01-23T13:52:55Z"
message: kubelet has sufficient PID available
reason: KubeletHasSufficientPID
status: "False"
type: PIDPressure
- lastHeartbeatTime: "2024-01-23T14:54:11Z"
lastTransitionTime: "2024-01-23T13:52:55Z"
message: kubelet is posting ready status
reason: KubeletReady
status: "True"
type: Ready
- lastHeartbeatTime: "2024-01-23T13:53:10Z"
lastTransitionTime: "2024-01-23T13:53:10Z"
message: Node is a voting member of the etcd cluster
reason: MemberNotLearner
status: "True"
type: EtcdIsVoter
daemonEndpoints:
kubeletEndpoint:
Port: 10250
images:
- names:
- docker.io/rancher/klipper-helm:v0.8.2-build20230815
sizeBytes: 256386482
- names:
- docker.io/rancher/mirrored-library-traefik:2.10.5
sizeBytes: 152810877
- names:
- docker.io/rancher/mirrored-metrics-server:v0.6.3
sizeBytes: 70293467
- names:
- docker.io/rancher/mirrored-coredns-coredns:1.10.1
sizeBytes: 53618774
- names:
- docker.io/rancher/local-path-provisioner:v0.0.24
sizeBytes: 40448904
- names:
- docker.io/rancher/klipper-lb:v0.4.4
sizeBytes: 12479235
- names:
- docker.io/rancher/mirrored-library-busybox:1.36.1
sizeBytes: 4494167
- names:
- docker.io/rancher/mirrored-pause:3.6
sizeBytes: 685866
nodeInfo:
architecture: amd64
bootID: 9b5cc646-1840-4e2b-a1b2-966a2cf6ee15
containerRuntimeVersion: containerd://1.7.11-k3s2
kernelVersion: 4.18.0-477.10.1.el8_8.x86_64
kubeProxyVersion: v1.28.5+k3s1
kubeletVersion: v1.28.5+k3s1
machineID: dbe31cb559074f648b289837ab12412f
operatingSystem: linux
osImage: Red Hat Enterprise Linux 8.8 (Ootpa)
systemUUID: dbe31cb5-5907-4f64-8b28-9837ab12412f
kind: List
metadata:
resourceVersion: "" K3s journald logs:
|
Please, please attach logs instead of pasting pages and pages into your comment. |
The coredns manifest includes the |
I see the same issue when reinstalling k3s on an existing node.
|
Node looks good
|
The manifest on disk is missing
|
The only relevant line in the logs
|
The manifest on disk isn't intended to have that key. The key is supposed to be added and modified by a controller that runs on the servers. However that controller has logged errors complaining that the configmap itself does not exist. Are you able to view that configmap in the running cluster? |
In my setup the configmap is visible in the cluster but still controller is unable to fetch configmap. $ kubectl -n kube-system get configmap coredns -o yaml apiVersion: v1
data:
Corefile: |
.:53 {
errors
health
ready
kubernetes cluster.local in-addr.arpa ip6.arpa {
pods insecure
fallthrough in-addr.arpa ip6.arpa
}
hosts /etc/coredns/NodeHosts {
ttl 60
reload 15s
fallthrough
}
prometheus :9153
forward . /etc/resolv.conf
cache 30
loop
reload
loadbalance
import /etc/coredns/custom/*.override
}
import /etc/coredns/custom/*.server
kind: ConfigMap
metadata:
annotations:
objectset.rio.cattle.io/applied: H4sIAAAAAAAA/4yQwWrzMBCEX0Xs2fEf20nsX9BDybH02lMva2kdq1Z2g6SkBJN3L8IUCiVtbyNGOzvfzoAn90IhOmHQcKmgAIsJQc+wl0CD8wQaSr1t1PzKSilFIUiIix4JfRoXHQjtdZHTuafAlCgq488xUSi9wK2AybEFDXvhwR2e8QQFHCnh50ZkloTJCcf8lP6NTIqUyuCkNJiSp9LJP5czoLjryztTWB0uE2iYmvjFuVSFenJsHx6tFf41gvGY6Y0Eshz/9D2e0OSZfIJVvMZExwzusSf/I9SIcQQNvaG6a+r/XVdV7abBddPtsN9W66Eedi0N7aberM22zaHf6t0tcPsIAAD//8Ix+PfoAQAA
objectset.rio.cattle.io/id: ""
objectset.rio.cattle.io/owner-gvk: k3s.cattle.io/v1, Kind=Addon
objectset.rio.cattle.io/owner-name: coredns
objectset.rio.cattle.io/owner-namespace: kube-system
creationTimestamp: "2024-01-23T13:52:29Z"
labels:
objectset.rio.cattle.io/hash: bce283298811743a0386ab510f2f67ef74240c57
name: coredns
namespace: kube-system
resourceVersion: "234"
uid: d2882fe7-c7df-4ee7-a3b6-b6446ae9869b |
I'm unable to reproduce this. All of our CI tests also ensure that coredns comes up properly in order for any commit to be merged. Is there anything unique about this system? Can you duplicate this on a clean host? I will also note that you said you're running 1.29.0, but the logs show 1.28.5. Are you seeing the same on both versions? |
Yes, I see this on both versions 1.28 and 1.29. To me it happens when trying to upgrade k3s:
repeat the same loop again on the same VM. If the VM is fresh, the installation goes through. But this is important for us to make sure that upgrade to a production will not have same problem, hence I am trying to figure out the root cause of it. |
@safderali5 what process are you using to upgrade from 1.27? When you upgrade to 1.29, are you stepping through 1.28 first? Can you list the specific patch releases that you are upgrading from and to? Similarly, how are you uninstalling k3s? Can you confirm that the uninstall script is completing successfully and all files are removed from disk? |
@harrisonbc was your cluster in question also upgraded, and if so from what version? |
I will find out |
@safderali5 your The uninstall script only removes content from the default data-dir at |
@brandond it's @harsimranmaan nodes that are using non standard data dir. |
I am using uninstall script generated by k3s and located at /usr/local/bin/k3s-uninstall.sh. |
I have done a fresh trial on a VM running k3s 1.28.5+k3s1, Steps include.
Problem is still the same, coredns configmap is not updated with NodeHosts. |
You've not mentioned this this at any point, but from your node yaml I see that you have a bunch of extra configuration and env vars. Several of the env vars are not used by K3s itself for anything, and I am unsure what you're trying to achieve by setting them. Can you provide the exact command you are running to install k3s, along with any config that you are placing in config.yaml, and any other config that you are passing to the core Kubernetes components, such as the audit policy file? Please also call out anything that you're deploying to the cluster that might restrict or otherwise interfere with access to resources, such as OPA Gatekeeper or Kyverno. Also, at one point you were using a custom data-dir on some of the nodes; is this still something you're doing when testing? k3s.io/node-args:
- server
- --cluster-init
- "true"
- --disable
- traefik
- --default-local-storage-path
- /var/lib/rancher/storage
- --disable-helm-controller
- "true"
- --https-listen-port
- "6443"
- --kube-apiserver-arg
- audit-log-path=/var/lib/rancher/k3s/server/logs/audit.log
- --kube-apiserver-arg
- audit-policy-file=/var/lib/rancher/k3s/server/audit.yaml
- --kube-apiserver-arg
- audit-log-maxage=30
- --kube-apiserver-arg
- audit-log-maxbackup=10
- --kube-apiserver-arg
- audit-log-maxsize=100
- --kube-apiserver-arg
- request-timeout=300s
- --kube-apiserver-arg
- service-account-lookup=true
- --kube-apiserver-arg
- anonymous-auth=false
- --kube-cloud-controller-manager-arg
- webhook-secure-port=0
- --kube-controller-manager-arg
- terminated-pod-gc-threshold=12500
- --kube-controller-manager-arg
- use-service-account-credentials=true
- --kubelet-arg
- container-log-max-files=4
- --kubelet-arg
- container-log-max-size=50Mi
- --kubelet-arg
- streaming-connection-idle-timeout=5m
- --kubelet-arg
- make-iptables-util-chains=true
- --node-external-ip
- 10.54.65.165
- --node-name
- vm-165.tests.lab.net
- --prefer-bundled-bin
- "true"
- --protect-kernel-defaults
- "true"
- --secrets-encryption
- "true"
- --tls-san
- 10.54.65.165
- --write-kubeconfig-mode
- "0600"
k3s.io/node-env:
K3S_DATA_DIR: /var/lib/rancher/k3s/data/28f7e87eba734b7f7731dc900e2c84e0e98ce869f3dcf57f65dc7bbb80e12e56
K3S_INTERNAL_CERTS_EXPIRATION_DAYS: "730"
K3S_UPGRADE: "false" You are also setting status:
addresses:
- address: 10.54.65.165
type: InternalIP
- address: 10.54.65.165
type: ExternalIP
- address: vm-165.tests.lab.net
type: Hostname |
My installation method is: export INSTALL_K3S_SELINUX_WARN=true
export INSTALL_K3S_SKIP_DOWNLOAD=true
export INSTALL_K3S_SKIP_SELINUX_RPM=true
./install.sh The script install.sh is from k3s source code of version 1.28.5+k3s1. While the config.yam passed to the k3s is: cluster-init: true
disable:
- traefik
default-local-storage-path: /var/lib/rancher/storage
disable-helm-controller: true
https-listen-port: 6443
kube-apiserver-arg:
- 'audit-log-path=/var/lib/rancher/k3s/server/logs/audit.log'
- 'audit-policy-file=/var/lib/rancher/k3s/server/audit.yaml'
- 'audit-log-maxage=30'
- 'audit-log-maxbackup=10'
- 'audit-log-maxsize=100'
- 'request-timeout=300s'
- 'service-account-lookup=true'
- 'anonymous-auth=false'
kube-cloud-controller-manager-arg:
- 'webhook-secure-port=0'
kube-controller-manager-arg:
- 'terminated-pod-gc-threshold=12500'
- 'use-service-account-credentials=true'
kubelet-arg:
- 'container-log-max-files=4'
- 'container-log-max-size=50Mi'
- 'streaming-connection-idle-timeout=5m'
- 'make-iptables-util-chains=true'
node-external-ip: 10.54.65.139
node-name: vm-139.tests.lab.net
prefer-bundled-bin: true
protect-kernel-defaults: true
secrets-encryption: true
tls-san:
- 10.54.65.139
write-kubeconfig-mode: "0600" audit policy: apiVersion: audit.k8s.io/v1
kind: Policy
rules:
- level: Metadata There is custom path used for persistent storage used by local path provisioner. Other storage paths are default in my setup. |
The node external IP is set so that loadbalancer service shall get that IP, as per the docs: |
Where are the
You conveniently cut off the last half of that paragraph:
If there is no external IP, then the internal IP is used. The external IP overrides the internal IP for the purposes of setting the ServiceLB ingress address. If they are the same, you should not set it. Similarly, you do not need to add the node IP to the tls-san list. The node IP and hostname are always included in the SAN list. |
I was facing the very same odd behavior when the
the node itself was in absolutely normal shape: status:
addresses:
- address: 192.168.0.200
type: InternalIP
- address: master
type: Hostname but the coredns pod wasn`t starting because of the missing entry in the CM, the k3s systemd unit has had many failed attempts to read the entire CM like this:
apparently, caused by https://github.com/k3s-io/k3s/blob/master/pkg/node/controller.go#L81 I run I assume there's some kind of a race-condition, when the controller is too fast to read this data from the CM, fails, then coredns deployment kicks in and stuck trying to access missing CM reference in the deployment in here. Eventually only normal systemd unit restart could have solved this issue, so that the controller was able to update the CM entry. |
The controller's cache should get populated eventually; it should succeed after a few retries at worst. I'm curious what's preventing the configmap cache from getting filled. |
@avoidik I am running that same release of K3s with the same args and I am unable to reproduce: metadata:
annotations:
k3s.io/node-args: '["server","--node-ip","172.17.0.7","--flannel-iface","eth0","--write-kubeconfig-mode","644","--kube-apiserver-arg","service-node-port-range=30000-30100","--secrets-encryption","--disable","traefik","--disable","servicelb","--disable-helm-controller","--disable-network-policy","--kubelet-arg","cloud-provider=external","--debug"]'
status:
nodeInfo:
architecture: amd64
bootID: 6751d5d3-977d-4c16-8d3f-9ae4cc0ec497
containerRuntimeVersion: containerd://1.7.11-k3s2
kernelVersion: 6.2.0-1014-aws
kubeProxyVersion: v1.28.5+k3s1
kubeletVersion: v1.28.5+k3s1 If you start k3s with
|
Is the coredns pod still in the ContainerCreating state? |
No, because the node host entry was added - see the last log line. brandond@dev01:~$ kubectl get pod -A -o wide
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
kube-system coredns-6799fbcd5-rwxbb 1/1 Running 0 20m 10.42.0.3 k3s-server-1 <none> <none>
kube-system local-path-provisioner-84db5d44d9-8qpgx 1/1 Running 0 20m 10.42.0.2 k3s-server-1 <none> <none>
kube-system metrics-server-67c658944b-69lvq 1/1 Running 0 20m 10.42.0.4 k3s-server-1 <none> <none> |
@brandond I now see what's going on when the
and here is pending coredns pod status (10 minutes was not enough to recover):
here is a sample Vagrantfile which could help to reproduce the problem in isolation: ENV_DEFAULTS = {
INSTALL_K3S_SKIP_DOWNLOAD: "true",
INSTALL_K3S_VERSION: "v1.28.5+k3s1",
K3S_TOKEN: "43bf98d8fb25e7fd4275ae06f33adacd",
IFACE_NAME: "enp0s8",
}
Vagrant.configure(2) do |config|
config.vm.define 'master' do |master|
master.vm.box = 'ubuntu/focal64'
master.vm.hostname = 'master'
master.vm.synced_folder '.', '/vagrant', type: 'virtualbox'
master.vm.network 'private_network', ip: '192.168.0.200'
master.vm.provider 'virtualbox' do |v|
v.memory = 2048
v.cpus = 2
v.name = 'k3s-master-01'
v.customize ['modifyvm', :id, '--audio', 'none']
v.customize ["modifyvm", :id, "--ioapic", 'on']
end
master.vm.provision 'shell', env: ENV_DEFAULTS, inline: <<-'SHELL'
curl -fsSL https://get.k3s.io -o install.sh
curl -fsSL https://github.com/k3s-io/k3s/releases/download/$INSTALL_K3S_VERSION/k3s -o /usr/local/bin/k3s
chmod +x /usr/local/bin/k3s
mkdir -p /var/lib/rancher/k3s/agent/images
curl -fsSL https://github.com/k3s-io/k3s/releases/download/$INSTALL_K3S_VERSION/k3s-airgap-images-amd64.tar.zst -o /var/lib/rancher/k3s/agent/images/k3s-airgap-images.tar.zst
IPADDR="$(ip addr show ${IFACE_NAME} | grep 'inet ' | awk '{print $2;}' | cut -d'/' -f1)"
export INSTALL_K3S_EXEC="server --node-ip=${IPADDR} --flannel-iface=${IFACE_NAME} --write-kubeconfig-mode=644 --kube-apiserver-arg=service-node-port-range=30000-30100 --secrets-encryption --disable=traefik --disable=servicelb --disable-helm-controller --disable-network-policy --kubelet-arg cloud-provider=external --debug"
sh install.sh
if [ -f /etc/rancher/k3s/k3s.yaml ]; then
cp /etc/rancher/k3s/k3s.yaml /tmp/
sed -i "s/127.0.0.1/${IPADDR}/" /tmp/k3s.yaml
mkdir -p /home/vagrant/.kube
cp /tmp/k3s.yaml /home/vagrant/.kube/config
cp /tmp/k3s.yaml /vagrant/
rm -f /tmp/k3s.yaml
chown -R vagrant:vagrant /home/vagrant/.kube
fi
SHELL
end
end
# vagrant up
# export KUBECONFIG="$(pwd)/k3s.yaml"
# kubectl get nodes
# kubectl get pods -n kube-system
# vagrant ssh interesting observation, it works as expected if I'd comment out the part where the airgapped images being downloaded, this makes me think about the assumption regarding the race condition I made before
|
Thanks for the steps! I can strip that down to the following shell commands, and reproduce the issue: export INSTALL_K3S_VERSION="v1.29.1-rc2+k3s2"
mkdir -p /var/lib/rancher/k3s/agent/images
curl -fsSL https://github.com/k3s-io/k3s/releases/download/${INSTALL_K3S_VERSION}/k3s-airgap-images-amd64.tar.zst -o /var/lib/rancher/k3s/agent/images/k3s-airgap-images.tar.zst
curl -fsSL get.k3s.io | sh -s - --node-ip=172.17.0.7 --flannel-iface=eth0 --write-kubeconfig-mode=644 --kube-apiserver-arg=service-node-port-range=30000-30100 --secrets-encryption --disable=traefik --disable=servicelb --disable-helm-controller --disable-network-policy --kubelet-arg cloud-provider=external --debug |
I think the key here is the startup being delayed a bit by the image import, combined with the helm controller being disabled. That results in the node controller running too late to trigger initialization of the configmap cache. |
Validated on Version:-$ k3s version v1.29.1+k3s-de825845 (de825845)
Environment DetailsInfrastructure Node(s) CPU architecture, OS, and Version: Cluster Configuration: Steps to validate the fix
Reproduction Issue:
Validation Results:
|
This comment was marked as off-topic.
This comment was marked as off-topic.
@InCogNiTo124 please open a new issue describing your configuration using the issue template. It does not sound like the same problem. |
After painful few months, I seem to have fixed it by disabling public ipv6 on a Hetzner instance. I'm finally successfully running something after |
When deploying latest k3s version v1.29.0+k3s1 the coredns pod is stuck in ContainerCreating stage as it cannot find the key NodeHosts in configmap coredns. When looking at the manifests definitions, it looks the problem is real.
Upon searching k3s documentation there is no mention of how to specify those NodeHosts without manually editing configmap after deployment.
Error:
Warning FailedMount 85s (x19 over 23m) kubelet MountVolume.SetUp failed for volume "config-volume" : configmap references non-existent config key: NodeHosts
Manifests Link:
https://github.com/k3s-io/k3s/blob/master/manifests/coredns.yaml
The text was updated successfully, but these errors were encountered: