Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubeadm: make kubeadm init and join output the same error #130040

Merged
merged 1 commit into from
Feb 11, 2025

Conversation

HirazawaUi
Copy link
Contributor

@HirazawaUi HirazawaUi commented Feb 7, 2025

What type of PR is this?

/kind feature

What this PR does / why we need it:

Currently, during kubeadm init, if waiting for the control plane components to start fails, we output a templated error message.

However, during kubeadm join, if waiting for the control plane components to start fails, we output a raw error message.

This PR aim to ensure that kubeadm init and kubeadm join output the same error message format when waiting for the control plane components to start fails.

Which issue(s) this PR fixes:

Fixes kubernetes/kubeadm#3149

Special notes for your reviewer:

Does this PR introduce a user-facing change?

kubeadm: improved `kubeadm init` and `kubeadm join` to provide consistent error messages when the kubelet failed or when failed to wait for control plane components.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Feb 7, 2025
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. area/kubeadm sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Feb 7, 2025
@HirazawaUi
Copy link
Contributor Author

After this PR, kubeadm init will throw the following error:

[kubelet-check] The kubelet is healthy after 502.472315ms

Unfortunately, an error has occurred:
	error

This error is likely caused by:
	- The kubelet is not running
	- The kubelet is unhealthy due to a misconfiguration of the node in some way (required cgroups disabled)

If you are on a systemd-powered system, you can try to troubleshoot the error with the following commands:
	- 'systemctl status kubelet'
	- 'journalctl -xeu kubelet'


	Additionally, a control plane component may have crashed or exited when started by the container runtime.
	To troubleshoot, list all containers using your preferred container runtimes CLI.
	Here is one example how you may list all running Kubernetes containers by using crictl:
		- 'crictl --runtime-endpoint unix:///run/containerd/containerd.sock ps -a | grep kube | grep -v pause'
		Once you have found the failing container, you can inspect its logs with:
		- 'crictl --runtime-endpoint unix:///run/containerd/containerd.sock logs CONTAINERID'
error execution phase wait-control-plane: could not initialize a Kubernetes cluster
To see the stack trace of this error execute with --v=5 or higher

After this PR, kubeadm join will throw the following error:

[kubelet-check] The kubelet is healthy after 501.697408ms

Unfortunately, an error has occurred:
	error

This error is likely caused by:
	- The kubelet is not running
	- The kubelet is unhealthy due to a misconfiguration of the node in some way (required cgroups disabled)

If you are on a systemd-powered system, you can try to troubleshoot the error with the following commands:
	- 'systemctl status kubelet'
	- 'journalctl -xeu kubelet'
error execution phase kubelet-wait-bootstrap: could not join the node to the Kubernetes cluster
To see the stack trace of this error execute with --v=5 or higher

@HirazawaUi
Copy link
Contributor Author

/cc @pacoxu @neolit123

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Feb 7, 2025
@HirazawaUi HirazawaUi force-pushed the make-error-consistent branch 2 times, most recently from 78afe9a to bd36af9 Compare February 7, 2025 14:30
Copy link
Member

@neolit123 neolit123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the PR @HirazawaUi
here is how i think this old code should be refactored.

cmd/kubeadm/app/cmd/phases/init/waitcontrolplane.go Outdated Show resolved Hide resolved
cmd/kubeadm/app/cmd/phases/init/waitcontrolplane.go Outdated Show resolved Hide resolved
cmd/kubeadm/app/phases/kubelet/kubelet.go Outdated Show resolved Hide resolved
cmd/kubeadm/app/phases/kubelet/kubelet.go Outdated Show resolved Hide resolved
cmd/kubeadm/app/phases/kubelet/kubelet.go Outdated Show resolved Hide resolved
cmd/kubeadm/app/phases/kubelet/kubelet.go Outdated Show resolved Hide resolved
cmd/kubeadm/app/phases/kubelet/kubelet.go Outdated Show resolved Hide resolved
@neolit123
Copy link
Member

please, always prefix notes with kubeadm: then small letter.
/release-note-edit

kubeadm: improved `kubeadm init` and `kubeadm join` to provide consistent error messages when the kubelet failed or when failed to wait for control plane components.

@HirazawaUi HirazawaUi force-pushed the make-error-consistent branch from bd36af9 to dcb0d13 Compare February 7, 2025 14:48
@HirazawaUi
Copy link
Contributor Author

All suggestions fixed.

cmd/kubeadm/app/phases/kubelet/kubelet.go Outdated Show resolved Hide resolved
cmd/kubeadm/app/phases/kubelet/kubelet.go Outdated Show resolved Hide resolved
cmd/kubeadm/app/phases/kubelet/kubelet.go Outdated Show resolved Hide resolved
cmd/kubeadm/app/phases/kubelet/kubelet.go Outdated Show resolved Hide resolved
cmd/kubeadm/app/cmd/phases/join/waitcontrolplane.go Outdated Show resolved Hide resolved
cmd/kubeadm/app/cmd/phases/join/kubelet.go Outdated Show resolved Hide resolved
cmd/kubeadm/app/cmd/phases/join/kubelet.go Outdated Show resolved Hide resolved
cmd/kubeadm/app/cmd/phases/init/waitcontrolplane.go Outdated Show resolved Hide resolved
cmd/kubeadm/app/cmd/phases/init/waitcontrolplane.go Outdated Show resolved Hide resolved
@neolit123
Copy link
Member

neolit123 commented Feb 7, 2025

LGTM, mostly. should have a second reviewer too.

could you show one error output from e.g. kubeadm init just to check the spaces and empty lines?

@HirazawaUi HirazawaUi force-pushed the make-error-consistent branch 2 times, most recently from 63be289 to 14aa3ce Compare February 7, 2025 15:25
@neolit123
Copy link
Member

ok, it would be nice to have another new line here:

	Once you have found the failing container, you can inspect its logs with:
	- 'crictl --runtime-endpoint unix:///run/containerd/containerd.sock logs CONTAINERID'

error execution phase wait-control-plane: failed while waiting for the kubelet to start: error
To see the stack trace of this error execute with --v=5 or higher

@neolit123
Copy link
Member

error execution phase wait-control-plane: failed while waiting for the kubelet to start: error

this dosn't seem correct, any idea why here is a trailing : error.

@HirazawaUi
Copy link
Contributor Author

HirazawaUi commented Feb 7, 2025

this dosn't seem correct, any idea why here is a trailing : error.

This error string is an error I injected. :)

ok, it would be nice to have another new line here:

Yes, I will debug and modify it further tomorrow.

Copy link
Member

@pacoxu pacoxu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM. Only one nit.

waiter.SetTimeout(data.Cfg().Timeouts.KubeletHealthCheck.Duration)
kubeletConfig := data.Cfg().ClusterConfiguration.ComponentConfigs[componentconfigs.KubeletGroup].Get()
kubeletConfigTyped, ok := kubeletConfig.(*kubeletconfig.KubeletConfiguration)
if !ok {
return errors.New("could not convert the KubeletConfiguration to a typed object")
}
if err := waiter.WaitForKubelet(kubeletConfigTyped.HealthzBindAddress, *kubeletConfigTyped.HealthzPort); err != nil {
return handleError(err)
kubelet.PrintKubeletErrorHelpScreen(data.OutputWriter(), data.Cfg().NodeRegistration.CRISocket, true)
return errors.Wrap(err, "failed while waiting for the kubelet to start")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The wrapped message use kubelet, but the printer is controlPlane=true.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

Initially, I just wanted to use the controlPlane parameter to determine whether to display the full content of kubeletFailTempl as before. However, we can make it more granular:

  • If kubelet fails to start, we should only display the kubelet startup failure error and guidance.
  • If the failure occurs while waiting for WaitForControlPlaneComponents, we should display both the kubelet startup failure information and container troubleshooting logs.


fmt.Fprintln(outputWriter, kubeletFailMsg)
if controlPlane {
_ = controlPlaneFailTempl.Execute(outputWriter, context)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We didn't print handle the error(probably print it) before. Not sure if we should do that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In other parts of kubeadm, calls to the Execute method similarly do not return errors, so I chose to ignore the error here.

@HirazawaUi HirazawaUi force-pushed the make-error-consistent branch from 14aa3ce to ae78ec1 Compare February 8, 2025 08:48
@HirazawaUi
Copy link
Contributor Author

kubeadm init:

[kubelet-check] The kubelet is healthy after 501.458126ms

Unfortunately, an error has occurred, likely caused by:
	- The kubelet is not running
	- The kubelet is unhealthy due to a misconfiguration of the node in some way (required cgroups disabled)

If you are on a systemd-powered system, you can try to troubleshoot the error with the following commands:
	- 'systemctl status kubelet'
	- 'journalctl -xeu kubelet'

Additionally, a control plane component may have crashed or exited when started by the container runtime.
To troubleshoot, list all containers using your preferred container runtimes CLI.
Here is one example how you may list all running Kubernetes containers by using crictl:
	- 'crictl --runtime-endpoint unix:///run/containerd/containerd.sock ps -a | grep kube | grep -v pause'
	Once you have found the failing container, you can inspect its logs with:
	- 'crictl --runtime-endpoint unix:///run/containerd/containerd.sock logs CONTAINERID'

error execution phase wait-control-plane: failed while waiting for the kubelet to start: error message
To see the stack trace of this error execute with --v=5 or higher

kubeadm join:

[kubelet-check] The kubelet is healthy after 501.000436ms

Unfortunately, an error has occurred, likely caused by:
	- The kubelet is not running
	- The kubelet is unhealthy due to a misconfiguration of the node in some way (required cgroups disabled)

If you are on a systemd-powered system, you can try to troubleshoot the error with the following commands:
	- 'systemctl status kubelet'
	- 'journalctl -xeu kubelet'

error execution phase kubelet-wait-bootstrap: failed while waiting for the kubelet to start: error message
To see the stack trace of this error execute with --v=5 or higher

@HirazawaUi HirazawaUi force-pushed the make-error-consistent branch from ae78ec1 to 8095cc3 Compare February 8, 2025 08:55
@pacoxu
Copy link
Member

pacoxu commented Feb 8, 2025

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 8, 2025
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: cf082b5632cfc470ce786396ae36ae647c3b6796

@@ -45,6 +68,21 @@ func TryStartKubelet() {
}
}

// PrintKubeletErrorHelpScreen prints help text on kubelet errors.
func PrintKubeletErrorHelpScreen(outputWriter io.Writer, criSocket string, waitControlPlaneComponents bool) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after some thinking let's split this func into two;
PrintKubeletErrorHelpScreen
PrintControlPlaneErrorHelpScreen

this is nicer than using the same function and having a flag waitControlPlaneComponents
also, when a control plane fails we already run the kubelet check earlier which passed. so the problem at that point seems like failing CP components, not failing kublet.

Copy link
Contributor Author

@HirazawaUi HirazawaUi Feb 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If possible, I would prefer to split this function only after the WaitForAllControlPlaneComponents feature gate reaches the GA stage. If users manually disable this feature gate, the error reporting could become confusing as we would lose troubleshooting hints for checking controlPlane components failures.

Alternatively, we could implement a more laborious approach now:

  • When the WaitForAllControlPlaneComponentsfeature gate is enabled, we would output separate troubleshooting hints for kubelet and controlPlane components respectively.
  • If it's disabled, we would maintain the current behavior.

Copy link
Member

@neolit123 neolit123 Feb 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WaitForControlPlaneComponents is basically the new WaitForAPI which used to wait for only apiserver. the help text to debug a control pod is relevant for both cases.

i think the two errors are distinctive. we shouldn't show an error that the kubelet has failed if the control plane (or just the apiserver) could not start. that's a separate problem.

to summarize, i think we should;

  • show a kubelet help text on WaitForKubeletErrors
  • show a control plan help text if WaitForAPI or WaitForControlPlaneComponents helped.

if the user hits the kubelet error they must first resolve that problem. it's blocking the workflow to continue further, then later they might also hit the CP error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair.

- 'journalctl -xeu kubelet'`)

controlPlaneFailTempl = template.Must(template.New("init").Parse(dedent.Dedent(`
Additionally, a control plane component may have crashed or exited when started by the container runtime.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Additionally, a control plane component may have crashed or exited when started by the container runtime.
A control plane component may have crashed or exited when started by the container runtime.

- 'journalctl -xeu kubelet'`)

controlPlaneFailTempl = template.Must(template.New("init").Parse(dedent.Dedent(`
Additionally, a control plane component may have crashed or exited when started by the container runtime.
Copy link
Member

@neolit123 neolit123 Feb 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

leaving the text like that even if the wiatforcontrolplanecomponents FG is disabled, still makes sense.
the old error will actually say that it couldn't probe kub-apiserver at healthz for n minutes.

@HirazawaUi HirazawaUi force-pushed the make-error-consistent branch from 8095cc3 to f99456e Compare February 11, 2025 10:16
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 11, 2025
@k8s-ci-robot k8s-ci-robot requested a review from pacoxu February 11, 2025 10:16
@HirazawaUi HirazawaUi force-pushed the make-error-consistent branch from f99456e to 8f78cdf Compare February 11, 2025 10:32
@@ -52,6 +54,17 @@ const (
argAdvertiseAddress = "advertise-address"
)

var (
controlPlaneFailTempl = template.Must(template.New("init").Parse(dedent.Dedent(`
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved controlPlaneFailTempl here. After the split, putting it in the kubelet's file seems a bit unreasonable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please also move KubeletFailMsg here and make it private var.
add exported function PrintKubeletErrorHelpScreen that just calls the fmt.Fprintln(data.OutputWriter(), kubeletFailMsg) consistency would be nicer that way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

@HirazawaUi HirazawaUi force-pushed the make-error-consistent branch from 8f78cdf to f66211b Compare February 11, 2025 10:58
@HirazawaUi HirazawaUi force-pushed the make-error-consistent branch from f66211b to ab02cda Compare February 11, 2025 13:21
@HirazawaUi
Copy link
Contributor Author

I used the latest code and manually injected errors. kubeadm output the following error log:

kubeadm join:

[kubelet-check] The kubelet is healthy after 500.96268ms

Unfortunately, an error has occurred, likely caused by:
	- The kubelet is not running
	- The kubelet is unhealthy due to a misconfiguration of the node in some way (required cgroups disabled)

If you are on a systemd-powered system, you can try to troubleshoot the error with the following commands:
	- 'systemctl status kubelet'
	- 'journalctl -xeu kubelet'

error execution phase kubelet-wait-bootstrap: failed while waiting for the kubelet to start: error message
To see the stack trace of this error execute with --v=5 or higher

CP components failed during kubeadm init:

[api-check] The API server is healthy after 2.501536321s

A control plane component may have crashed or exited when started by the container runtime.
To troubleshoot, list all containers using your preferred container runtimes CLI.
Here is one example how you may list all running Kubernetes containers by using crictl:
	- 'crictl --runtime-endpoint unix:///run/containerd/containerd.sock ps -a | grep kube | grep -v pause'
	Once you have found the failing container, you can inspect its logs with:
	- 'crictl --runtime-endpoint unix:///run/containerd/containerd.sock logs CONTAINERID'

error execution phase wait-control-plane: failed while waiting for the control plane to start: error message
To see the stack trace of this error execute with --v=5 or higher

kubelet startup failed during kubeadm init:

[kubelet-check] The kubelet is healthy after 502.490188ms

Unfortunately, an error has occurred, likely caused by:
	- The kubelet is not running
	- The kubelet is unhealthy due to a misconfiguration of the node in some way (required cgroups disabled)

If you are on a systemd-powered system, you can try to troubleshoot the error with the following commands:
	- 'systemctl status kubelet'
	- 'journalctl -xeu kubelet'

error execution phase wait-control-plane: failed while waiting for the kubelet to start: error message
To see the stack trace of this error execute with --v=5 or higher

Copy link
Member

@neolit123 neolit123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

thanks

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 11, 2025
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 8c9433873aa0473f3334c458c13ae61b87f9f873

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: HirazawaUi, neolit123

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 11, 2025
@k8s-ci-robot k8s-ci-robot merged commit e30c8a3 into kubernetes:master Feb 11, 2025
13 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.33 milestone Feb 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubeadm cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make kubeadm init and join output the same error after failing to wait for control plane components
4 participants