Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to install RKE2 on Amazon Linux 2023 #4527

Closed
zackbradys opened this issue Jul 31, 2023 · 17 comments
Closed

Unable to install RKE2 on Amazon Linux 2023 #4527

zackbradys opened this issue Jul 31, 2023 · 17 comments

Comments

@zackbradys
Copy link
Contributor

zackbradys commented Jul 31, 2023

Environmental Info:
RKE2 Version: v1.25.12+rke2v1

Node(s) CPU architecture, OS, and Version: Amazon Linux 2023 (AL2023) with ami-0f34c5ae932e6f0e4

[root@ip-172-31-32-223 rke2-artifacts]# cat /etc/os-release 
NAME="Amazon Linux"
VERSION="2023"
ID="amzn"
ID_LIKE="fedora"
VERSION_ID="2023"
PLATFORM_ID="platform:al2023"
PRETTY_NAME="Amazon Linux 2023"
ANSI_COLOR="0;33"
CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2023"
HOME_URL="https://aws.amazon.com/linux/"
BUG_REPORT_URL="https://github.com/amazonlinux/amazon-linux-2023"
SUPPORT_END="2028-03-01"

Cluster Configuration: Single Node (testing purposes)

Describe the bug: Unable to the download, install, or activate RKE2 on Amazon Linux 2023 (AL2023).

Steps To Reproduce:

[root@ip-172-31-39-26 ec2-user]#  curl -sfL https://get.rke2.io | sh
[INFO]  finding release for channel stable
[INFO]  using 1.25 series from channel stable
install: invalid option -- 'y'
Try 'install --help' for more information.

OR

[root@ip-172-31-32-223 rke2-artifacts]# curl -sfL https://get.rke2.io | INSTALL_RKE2_CHANNEL=v1.25 INSTALL_RKE2_TYPE=server sh -
[INFO]  using stable RPM repositories
[INFO]  using 1.25 series from channel stable
install: invalid option -- 'y'
Try 'install --help' for more information.

After editing Line 477 on the install.sh script to include [ -r /etc/amazon-linux-release ] || or creating a file at /etc/centos-release, RKE2 will successfully download and install necessary requirements, but versions down to v1.25.4+rke2v1 and el7 when I would expect AL2023 to be more similar to el8.

Installed:
  container-selinux-2:2.189.0-289.amzn2023.0.2.noarch            iptables-libs-1.8.8-3.amzn2023.0.2.x86_64            iptables-nft-1.8.8-3.amzn2023.0.2.x86_64           
  libnetfilter_conntrack-1.0.8-2.amzn2023.0.2.x86_64             libnfnetlink-1.0.1-19.amzn2023.0.2.x86_64            libnftnl-1.2.2-2.amzn2023.0.2.x86_64               
  rke2-common-1.25.4~rke2r1-0.el7.x86_64                         rke2-selinux-0.8-2.el7.noarch                        rke2-server-1.25.4~rke2r1-0.el7.x86_64             
Skipped:
  rke2-common-1.25.12~rke2r1-0.el7.x86_64          rke2-selinux-0.12-1.el7.noarch          rke2-selinux-0.13-1.el7.noarch          rke2-selinux-0.14-1.el7.noarch         
  rke2-server-1.25.12~rke2r1-0.el7.x86_64    

After this change, upon activating RKE2 with systemctl start rke2-server, it fails and does not produce any useful troubleshooting information.

Expected behavior: Download, Install, and Activate RKE2 on Amazon Linux 2023 (AL2023).

Actual behavior: RKE2 fails and errors when downloading, install, and activating on Amazon Linux 2023 (AL2023).

Additional context / logs:

  • journalctl -xefu rke2-server does not produce any useful information (500 Internal Server Error).
  • /var/lib/rancher/rke2/agent/logs/kubelet.log does not appear to have any useful information.
  • /var/lib/rancher/rke2/agent/containerd/containerd.log does not appear to have any useful information.
  • /var/lib/rancher/rke2/bin/crictl ps does not appear to have any useful information.
@brandond
Copy link
Member

brandond commented Aug 1, 2023

We don't technically support Amazon Linux at this time. Ref: https://docs.rke2.io/install/requirements#linux

@brandond brandond added this to the Backlog milestone Aug 1, 2023
@zackbradys
Copy link
Contributor Author

zackbradys commented Aug 1, 2023

Hey @brandond, definitely understand and didn't expect it to work without any troubleshooting or fixes! It's may be referenced internally with a little bit more discussion due to some of our government customers.

@zackbradys
Copy link
Contributor Author

Forgot to tag you earlier... @dweomer

@brandond
Copy link
Member

brandond commented Aug 1, 2023

For the various log files that "do not appear to have any useful information", can you attach them anyway? Along with whatever is in /var/log/pods

@zackbradys
Copy link
Contributor Author

zackbradys commented Aug 1, 2023

For sure... I didn't want to overcrowd the GH Issue. I'll attach them now.

@zackbradys
Copy link
Contributor Author

journalctl -xefu rke2-server

Aug 01 03:10:45 ip-172-31-42-40.ec2.internal rke2[26348]: {"level":"warn","ts":"2023-08-01T03:10:45.436Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000f0c540/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\""}
Aug 01 03:10:45 ip-172-31-42-40.ec2.internal rke2[26348]: time="2023-08-01T03:10:45Z" level=error msg="Failed to check local etcd status for learner management: context deadline exceeded"
Aug 01 03:10:46 ip-172-31-42-40.ec2.internal rke2[26348]: time="2023-08-01T03:10:46Z" level=info msg="Container for etcd not found (no matching container found), retrying"
Aug 01 03:10:46 ip-172-31-42-40.ec2.internal rke2[26348]: time="2023-08-01T03:10:46Z" level=info msg="Tunnel server egress proxy waiting for runtime core to become available"
Aug 01 03:10:50 ip-172-31-42-40.ec2.internal rke2[26348]: time="2023-08-01T03:10:50Z" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: https://127.0.0.1:9345/v1-rke2/readyz: 500 Internal Server Error"
Aug 01 03:10:51 ip-172-31-42-40.ec2.internal rke2[26348]: time="2023-08-01T03:10:51Z" level=info msg="Tunnel server egress proxy waiting for runtime core to become available"
Aug 01 03:10:55 ip-172-31-42-40.ec2.internal rke2[26348]: time="2023-08-01T03:10:55Z" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: https://127.0.0.1:9345/v1-rke2/readyz: 500 Internal Server Error"
Aug 01 03:10:56 ip-172-31-42-40.ec2.internal rke2[26348]: time="2023-08-01T03:10:56Z" level=info msg="Waiting for etcd server to become available"
Aug 01 03:10:56 ip-172-31-42-40.ec2.internal rke2[26348]: time="2023-08-01T03:10:56Z" level=info msg="Waiting for API server to become available"
Aug 01 03:10:56 ip-172-31-42-40.ec2.internal rke2[26348]: time="2023-08-01T03:10:56Z" level=info msg="Tunnel server egress proxy waiting for runtime core to become available"
Aug 01 03:11:00 ip-172-31-42-40.ec2.internal rke2[26348]: time="2023-08-01T03:11:00Z" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: https://127.0.0.1:9345/v1-rke2/readyz: 500 Internal Server Error"

/var/lib/rancher/rke2/agent/logs/kubelet.log

E0801 03:12:13.723095   26379 event.go:276] Unable to write event: '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"ip-172-31-42-40.ec2.internal.1777230d62cd7dd6", GenerateName:"", Namespace:"default", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), DeletionTimestamp:<nil>, DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ManagedFields:[]v1.ManagedFieldsEntry(nil)}, InvolvedObject:v1.ObjectReference{Kind:"Node", Namespace:"", Name:"ip-172-31-42-40.ec2.internal", UID:"ip-172-31-42-40.ec2.internal", APIVersion:"", ResourceVersion:"", FieldPath:""}, Reason:"NodeHasSufficientPID", Message:"Node ip-172-31-42-40.ec2.internal status is now: NodeHasSufficientPID", Source:v1.EventSource{Component:"kubelet", Host:"ip-172-31-42-40.ec2.internal"}, FirstTimestamp:time.Date(2023, time.August, 1, 2, 58, 45, 500091862, time.Local), LastTimestamp:time.Date(2023, time.August, 1, 2, 58, 45, 579890048, time.Local), Count:2, Type:"Normal", EventTime:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}': 'Patch "https://127.0.0.1:6443/api/v1/namespaces/default/events/ip-172-31-42-40.ec2.internal.1777230d62cd7dd6": dial tcp 127.0.0.1:6443: connect: connection refused'(may retry after sleeping)
E0801 03:12:13.758179   26379 kubelet.go:2448] "Error getting node" err="node \"ip-172-31-42-40.ec2.internal\" not found"

/var/lib/rancher/rke2/agent/containerd/containerd.log

time="2023-08-01T03:13:11.996045748Z" level=info msg="cleaning up dead shim"
time="2023-08-01T03:13:12.011302269Z" level=warning msg="cleanup warnings time=\"2023-08-01T03:13:12Z\" level=info msg=\"starting signal loop\" namespace=k8s.io pid=30134 runtime=io.containerd.runc.v2\ntime=\"2023-08-01T03:13:12Z\" level=warning msg=\"failed to read init pid file\" error=\"open /run/k3s/containerd/io.containerd.runtime.v2.task/k8s.io/ff527d466f55f3f472de15bd68b23fdc4db058e9eaa4a1190c96d95576acc2cc/init.pid: no such file or directory\" runtime=io.containerd.runc.v2\n"
time="2023-08-01T03:13:12.011502689Z" level=error msg="copy shim log" error="read /proc/self/fd/20: file already closed"
time="2023-08-01T03:13:12.015109397Z" level=error msg="RunPodSandbox for &PodSandboxMetadata{Name:etcd-ip-172-31-42-40.ec2.internal,Uid:e18aa5e5b83a5a3c56d78e4054612394,Namespace:kube-system,Attempt:0,} failed, error" error="failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: write /proc/self/attr/keycreate: invalid argument: unknown"
time="2023-08-01T03:13:22.700218436Z" level=info msg="RunPodSandbox for &PodSandboxMetadata{Name:etcd-ip-172-31-42-40.ec2.internal,Uid:e18aa5e5b83a5a3c56d78e4054612394,Namespace:kube-system,Attempt:0,}"
time="2023-08-01T03:13:22.726881131Z" level=info msg="loading plugin \"io.containerd.event.v1.publisher\"..." runtime=io.containerd.runc.v2 type=io.containerd.event.v1
time="2023-08-01T03:13:22.726961198Z" level=info msg="loading plugin \"io.containerd.internal.v1.shutdown\"..." runtime=io.containerd.runc.v2 type=io.containerd.internal.v1
time="2023-08-01T03:13:22.726976863Z" level=info msg="loading plugin \"io.containerd.ttrpc.v1.task\"..." runtime=io.containerd.runc.v2 type=io.containerd.ttrpc.v1
time="2023-08-01T03:13:22.727130660Z" level=info msg="starting signal loop" namespace=k8s.io path=/run/k3s/containerd/io.containerd.runtime.v2.task/k8s.io/4248b813ddeb9cad229cb501b5e5dddc68efa178880387b81f08715a769d3706 pid=30154 runtime=io.containerd.runc.v2
time="2023-08-01T03:13:23.105979383Z" level=info msg="shim disconnected" id=4248b813ddeb9cad229cb501b5e5dddc68efa178880387b81f08715a769d3706
time="2023-08-01T03:13:23.106031080Z" level=warning msg="cleaning up after shim disconnected" id=4248b813ddeb9cad229cb501b5e5dddc68efa178880387b81f08715a769d3706 namespace=k8s.io

/var/lib/rancher/rke2/bin/crictl ps

CONTAINER           IMAGE               CREATED             STATE               NAME                ATTEMPT             POD ID              POD

ls /var/log/pods

kube-system_etcd-ip-172-31-42-40.ec2.internal_e18aa5e5b83a5a3c56d78e4054612394

@brandond
Copy link
Member

brandond commented Aug 1, 2023

Can you attach (not paste inline) the full contents, not just the last few lines? Same for the pod logs - just knowing that there is a log file doesn't help me much.

@zackbradys
Copy link
Contributor Author

I definitely missed the attach part of your previous comment. No logs in /var/log/pods, only the directory previously listed. Let me know what else may be helpful!

Logs: kubelet.log | containerd.log | journalctl.txt

@brandond
Copy link
Member

brandond commented Aug 1, 2023

From your containerd log file:

time="2023-08-01T03:56:53.209518491Z" level=error msg="RunPodSandbox for &PodSandboxMetadata{Name:etcd-ip-172-31-42-42.ec2.internal,Uid:e18aa5e5b83a5a3c56d78e4054612394,Namespace:kube-system,Attempt:0,} failed, error" error="failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: write /proc/self/attr/keycreate: invalid argument: unknown"

A little searching brings up #1539 (comment) and the following comments, which suggests that the rke2-selinux package you chose to install is not compatible with the version of container-selinux that Amazon Linux is providing.

Did you see and ignore any errors from the postinstall script when installing the EL7 rke2-selinux package? Have you tried the EL8 or EL9 RPMs from https://github.com/rancher/rke2-selinux/releases/tag/v0.14.stable.1 ?

@zackbradys
Copy link
Contributor Author

zackbradys commented Aug 1, 2023

I let the install script handle it, I didn't install anything manually. I don't remember seeing any errors with the install.sh, but definitely have seen issues in the past with container-selinux-2.189.0-289. I'll do a fresh install and double check the output and do some testing with different versions

@zackbradys
Copy link
Contributor Author

zackbradys commented Aug 1, 2023

Additionally, upgrading container-selinux to container-selinux v2.205.0 should work, if the install.sh script is modified with [ -r /etc/amazon-linux-release ].

Errors when running curl -sfL https://get.rke2.io | sh without any modification or workarounds:

[root@ip-172-31-39-26 yum.repos.d]# curl -sfL https://get.rke2.io | sh
[INFO]  finding release for channel stable
[INFO]  using 1.25 series from channel stable
Rancher RKE2 Common Latest                                                                                                                   5.3 kB/s | 2.6 kB     00:00    
Rancher RKE2 1.18 Latest                                                                                                                      12 kB/s | 6.0 kB     00:00    
Rancher RKE2 Common (stable)                                                                                                                  13 kB/s | 2.9 kB     00:00    
Rancher RKE2 1.25 (stable)                                                                                                                    13 kB/s | 2.9 kB     00:00    
Error: 
 Problem: package rke2-server-1.25.12~rke2r1-0.el7.x86_64 requires rke2-common = 1.25.12~rke2r1-0.el7, but none of the providers can be installed
  - package rke2-common-1.25.12~rke2r1-0.el7.x86_64 requires rke2-selinux >= 0.12-0, but none of the providers can be installed
  - conflicting requests
  - nothing provides container-selinux < 2:2.164.2 needed by rke2-selinux-0.12-1.el7.noarch
  - nothing provides container-selinux < 2:2.164.2 needed by rke2-selinux-0.13-1.el7.noarch
  - nothing provides container-selinux < 2:2.164.2 needed by rke2-selinux-0.14-1.el7.noarch

@brandond
Copy link
Member

brandond commented Aug 1, 2023

That sounds about right. Glad to know it works when you install the correct selinux package.

I don't believe we have any plans to support Amazon Linux, but we can leave this issue open so the next person who tries can find your steps.

@zackbradys
Copy link
Contributor Author

Appreciate the help working through it! It's nice to have a workaround for it.

@zackbradys
Copy link
Contributor Author

zackbradys commented Aug 1, 2023

AL2023 Package Request: amazonlinux/amazon-linux-2023#409

@stewartsmith
Copy link

Greetings from Amazon Linux land: it's important to note that AL2023 is not a CentOS clone, and does not claim any level of compatibility with any particular version of CentOS, thus using el7, el8, or el9 RPMs is going to be an uphill battle for anything non-trivial.

If you're looking to build packages for AL2023, you can do so the standard way (mock configurations are in upstream mock), and you can even use COPR to build for AL2023.

Happy to chat as to what the build requirements could be to enable builds of AL2023 packages.

@zackbradys
Copy link
Contributor Author

zackbradys commented Aug 17, 2023

Hey @stewartsmith, I saw your comment on the other issue (amazonlinux/amazon-linux-2023#409) and replied to it a few minutes ago. Apologies for not seeing this comment!

I definitely understand that AL2023 is not a clone or directly compatible with any specific distro, but I believe there is an industry standard derived from the most adopted linux distros that should be taken into account. In the long term, I hope Rancher decides to official support the Amazon linux distros and build packages to support that effort, but a small lift of ensuring packages stay updated could be a fix for us all. I know many of my customers and others users run Rancher RKE2 on AWS and would love to be using AL2023 as the linux distro!

@fmoral2
Copy link
Contributor

fmoral2 commented Dec 13, 2023

Validated on Version:

-$  rke2 version v1.28.4+dev.c1494f5d (c1494f5de1d2f6ae26cbb7d8ec365344dc1209d8)

Environment Details

Infrastructure
Cloud EC2 instance

Node(s) CPU architecture, OS, and Version:

NAME="Amazon Linux"
VERSION="2"
ID="amzn"
ID_LIKE="centos rhel fedora"
VERSION_ID="2"
PRETTY_NAME="Amazon Linux 2"
ANSI_COLOR="0;33"
CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2"
HOME_URL="https://amazonlinux.com/"
SUPPORT_END="2025-06-30"

Cluster Configuration:
1 node servers

Steps to validate the fix

  1. Install rke2
  2. apply config file if needed
  3. Enable and start rke2 service
  4. Check if everything is running

Validation Results:

sudo mkdir -p /etc/rancher/rke2

sudo bash -c 'cat <<EOF>/etc/rancher/rke2/config.yaml
EOF'

curl -sfL https://get.rke2.io | sudo INSTALL_RKE2_COMMIT=c1494f5de1d2f6ae26cbb7d8ec365344dc1209d8 sh - 


sudo systemctl enable rke2-server --now



rke2 -v
rke2 version v1.28.4+dev.c1494f5d (c1494f5de1d2f6ae26cbb7d8ec365344dc1209d8)
go version go1.20.11 X:boringcrypto



k get nodes,pods -A
NAME                                              STATUS   ROLES                       AGE   VERSION
node/i .us-east-2.compute.internal   Ready    control-plane,etcd,master   42s   v1.28.4+rke2r1

NAMESPACE     NAME                                                                      READY   STATUS              RESTARTS   AGE
kube-system   pod/cloud-controller-manager-ip- .us-east-2.compute.internal   1/1     Running             0          31s
kube-system   pod/etcd-ip- .us-east-2.compute.internal                       1/1     Running             0          16s
kube-system   pod/helm-install-rke2-canal-sv877                                         0/1     Completed           0          25s
kube-system   pod/helm-install-rke2-coredns-59bqs                                       0/1     Completed           0          25s
kube-system   pod/helm-install-rke2-ingress-nginx-bcgl9                                 0/1     ContainerCreating   0          25s
kube-system   pod/helm-install-rke2-metrics-server-g5np5                                0/1     ContainerCreating   0          25s
kube-system   pod/helm-install-rke2-snapshot-controller-crd-cwl6r                       0/1     ContainerCreating   0          25s
kube-system   pod/helm-install-rke2-snapshot-controller-z6dr4                           0/1     ContainerCreating   0          25s
kube-system   pod/helm-install-rke2-snapshot-validation-webhook-jpctx                   0/1     ContainerCreating   0          25s
kube-system   pod/kube-apiserver-ip .us-east-2.compute.internal             1/1     Running             0          34s
kube-system   pod/kube-controller-manager-ip .us-east-2.compute.internal    1/1     Running             0          30s
kube-system   pod/kube-proxy-ip- .us-east-2.compute.internal                 1/1     Running             0          34s
kube-system   pod/kube-scheduler-ip-  .us-east-2.compute.internal             1/1     Running             0          30s
kube-system   pod/rke2-canal-vmblz                                                      0/2     Init:0/2            0          13s
kube-system   pod/rke2-coredns-rke2-coredns-6b795db654-jkzds                            0/1     ContainerCreating   0          14s
kube-system   pod/rke2-coredns-rke2-coredns-autoscaler-945fbd459-6dgrc                  0/1     ContainerCreating   0          14s


 

 

@fmoral2 fmoral2 closed this as completed Dec 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants