Split tutorial for CAPI, add capi troubleshooting pages

canonical · Jan 31, 2025 · 341b7c0 · 341b7c0
1 parent 12f659e
commit 341b7c0
Show file tree

Hide file tree

Showing 5 changed files with 364 additions and 80 deletions.
diff --git a/docs/src/capi/howto/index.md b/docs/src/capi/howto/index.md
@@ -14,13 +14,15 @@ Overview <self>
 :glob:
 :titlesonly:
 
+Provision a Canonical Kubernetes cluster <provision>
 Install custom Canonical Kubernetes <custom-ck8s>
 Use external etcd <external-etcd.md>
 Upgrade the Kubernetes version <rollout-upgrades>
 Perform an in-place upgrade <in-place-upgrades>
 Upgrade the providers of a management cluster <upgrade-providers>
 Migrate the management cluster <migrate-management>
 Refresh workload cluster certificates <refresh-certs>
+Troubleshooting <troubleshooting>
 ```
 
 ---

diff --git a/docs/src/capi/howto/provision.md b/docs/src/capi/howto/provision.md
@@ -0,0 +1,100 @@
+# Provisioning a {{product}} cluster with CAPI 
+
+This guide covers how to deploy a {{product}} multi-node cluster
+using Cluster API (CAPI).
+
+## Prerequisites
+
+This guide assumes the following:
+- A CAPI management cluster initialised with the infrastructure, bootstrap and
+  control plane providers of your choice. Please refer to the
+  [getting-started guide] for instructions.
+
+## Generate a cluster spec manifest
+
+You can generate a cluster manifest for a selected set of commonly used
+infrastructures via templates provided by the {{product}} team.
+Ensure you have initialized the desired infrastructure provider and fetch
+the {{product}} provider repository:
+
+```
+git clone https://github.com/canonical/cluster-api-k8s
+```
+
+Review the list of variables needed for the cluster template:
+
+```
+cd cluster-api-k8s
+export CLUSTER_NAME=yourk8scluster
+clusterctl generate cluster ${CLUSTER_NAME} --from ./templates/<infrastructure-provider>/cluster-template.yaml --list-variables
+```
+
+Set the respective environment variables by editing the rc file as needed
+before sourcing it. Then generate the cluster manifest:
+
+```
+source ./templates/<infrastructure-provider>/template-variables.rc
+clusterctl generate cluster ${CLUSTER_NAME} --from ./templates/<infrastructure-provider>/cluster-template.yaml > cluster.yaml
+```
+
+Each provisioned node is associated with a `CK8sConfig`, through which you can
+set the cluster’s properties. Available configuration fields can be listed in detail with:
+
+```
+sudo k8s kubectl explain CK8sConfig.spec
+```
+
+Review the available options in the respective
+definitions file and edit the cluster manifest (`cluster.yaml` above) to match
+your needs.
+
+## Deploy the cluster
+
+To deploy the cluster, run:
+
+```
+sudo k8s kubectl apply -f cluster.yaml
+```
+
+For an overview of the cluster status, run:
+
+```
+clusterctl describe cluster ${CLUSTER_NAME}
+```
+
+To get the list of provisioned clusters:
+
+```
+sudo k8s kubectl get clusters
+```
+
+To see the deployed machines:
+
+```
+sudo k8s kubectl get machine
+```
+
+After the first control plane node is provisioned, you can get the kubeconfig
+of the workload cluster:
+
+```
+clusterctl get kubeconfig ${CLUSTER_NAME} > ./${CLUSTER_NAME}-kubeconfig
+```
+
+You can then see the workload nodes using:
+
+```
+KUBECONFIG=./${CLUSTER_NAME}-kubeconfig sudo k8s kubectl get node
+```
+
+## Delete the cluster
+
+To delete a cluster, run:
+
+```
+sudo k8s kubectl delete cluster ${CLUSTER_NAME}
+```
+
+<!-- LINKS -->
+
+[getting-started guide]: ../tutorial/getting-started
diff --git a/docs/src/capi/howto/troubleshooting.md b/docs/src/capi/howto/troubleshooting.md
@@ -0,0 +1,252 @@
+# How to troubleshoot {{product}}
+
+Identifying issues in a Kubernetes cluster can be difficult, especially to new
+users. With {{product}} we aim to make deploying and managing your cluster as
+easy as possible. This how-to guide will walk you through the steps to
+troubleshoot your {{product}} cluster.
+
+## Check the cluster status
+
+Verify that the cluster status is ready by running:
+
+```
+sudo k8s kubectl get cluster,ck8scontrolplane,machinedeployment,machine
+```
+
+You should see a command output similar to the following:
+
+```
+NAME                                  CLUSTERCLASS   PHASE         AGE   VERSION
+cluster.cluster.x-k8s.io/my-cluster                  Provisioned   16m
+
+NAME                                                                      INITIALIZED   API SERVER AVAILABLE   VERSION   REPLICAS   READY   UPDATED   UNAVAILABLE
+ck8scontrolplane.controlplane.cluster.x-k8s.io/my-cluster-control-plane   true          true                   v1.32.1   1          1       1
+
+NAME                                                        CLUSTER      REPLICAS   READY   UPDATED   UNAVAILABLE   PHASE     AGE   VERSION
+machinedeployment.cluster.x-k8s.io/my-cluster-worker-md-0   my-cluster   1          1       1         0             Running   16m   v1.32.1
+
+NAME                                                          CLUSTER      NODENAME                                           PROVIDERID      PHASE     AGE   VERSION
+machine.cluster.x-k8s.io/my-cluster-control-plane-j7w6m       my-cluster   my-cluster-cp-my-cluster-control-plane-j7w6m       <provider-id>   Running   16m   v1.32.1
+machine.cluster.x-k8s.io/my-cluster-worker-md-0-8zlzv-7vff7   my-cluster   my-cluster-wn-my-cluster-worker-md-0-8zlzv-7vff7   <provider-id>   Running   80s   v1.32.1
+```
+
+## Check providers status
+
+{{product}} cluster provisioning failures could happen in multiple providers used in CAPI.
+
+Check the {{product}} bootstrap provider logs:
+
+```
+k8s kubectl logs -n cabpck-system deployment/cabpck-bootstrap-controller-manager
+```
+
+Examine the {{product}} control-plane provider logs:
+
+```
+k8s kubectl logs -n cacpck-system deployment/cacpck-controller-manager
+```
+
+Review the CAPI controller logs:
+
+```
+k8s kubectl logs -n capi-system deployment/capi-controller-manager
+```
+
+Check the logs for the infrastructure provider by running:
+
+```
+k8s kubectl logs -n <infrastructure-provider-namespace> <infrastructure-provider-deployment>
+```
+
+## Test the API server health
+
+Fetch the kubeconfig file for a {{product}} cluster provisioned through CAPI by running:
+
+```
+clusterctl get kubeconfig ${CLUSTER_NAME} > ./${CLUSTER_NAME}-kubeconfig.yaml
+```
+
+Verify that the API server is healthy and reachable by running:
+
+```
+kubectl --kubeconfig ${CLUSTER_NAME}-kubeconfig.yaml get all
+```
+
+This command lists resources that exist under the default namespace. If the API
+server is healthy you should see a command output similar to the following:
+
+```
+NAME                 TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)   AGE
+service/kubernetes   ClusterIP   10.152.183.1   <none>        443/TCP   29m
+```
+
+A typical error message may look like this if the API server can not be reached:
+
+```
+The connection to the server 127.0.0.1:6443 was refused - did you specify the right host or port?
+```
+
+A failure can mean that:
+
+* The API server is not reachable due to network issues or firewall limitations
+* The API server on the particular node is unhealthy
+* All control plane nodes are down
+
+## Check the cluster nodes' health
+
+Confirm that the nodes in the cluster are healthy by looking for the `Ready`
+status:
+
+```
+kubectl --kubeconfig ${CLUSTER_NAME}-kubeconfig.yaml get nodes
+```
+
+You should see a command output similar to the following:
+
+```
+NAME                                               STATUS   ROLES                  AGE     VERSION
+my-cluster-cp-my-cluster-control-plane-j7w6m       Ready    control-plane,worker   17m     v1.32.1
+my-cluster-wn-my-cluster-worker-md-0-8zlzv-7vff7   Ready    worker                 2m14s   v1.32.1
+```
+
+## Troubleshoot an unhealthy node
+
+Every healthy {{ product }} node has certain services up and running. The
+required services depend on the type of node.
+
+Services running on both the control plane and worker nodes:
+
+* `k8sd`
+* `kubelet`
+* `containerd`
+* `kube-proxy`
+
+Services running only on the control-plane nodes:
+
+* `kube-apiserver`
+* `kube-controller-manager`
+* `kube-scheduler`
+* `k8s-dqlite`
+
+Services running only on the worker nodes:
+
+* `k8s-apiserver-proxy`
+
+Make the necessary adjustments for SSH access depending on your infrastructure provider and SSH into the unhealthy node with:
+
+```
+ssh <user>@<node>
+```
+
+Check the status of the services on the failing node by running:
+
+```
+sudo systemctl status snap.k8s.<service>
+```
+
+Check the logs of a failing service by executing:
+
+```
+sudo journalctl -xe -u snap.k8s.<service>
+```
+
+If the issue indicates a problem with the configuration of the services on the
+node, examine the arguments used to run these services.
+
+The arguments of a service on the failing node can be examined by reading the
+file located at `/var/snap/k8s/common/args/<service>`.
+
+## Investigate system pods' health
+
+Check whether all of the cluster's pods are `Running` and `Ready`:
+
+```
+kubectl --kubeconfig ${CLUSTER_NAME}-kubeconfig.yaml get pods -n kube-system
+```
+
+The pods in the `kube-system` namespace belong to {{product}}' features such as
+`network`. Unhealthy pods could be related to configuration issues or nodes not
+meeting certain requirements.
+
+## Troubleshoot a failing pod
+
+Look at the events on a failing pod by running:
+
+```
+kubectl --kubeconfig ${CLUSTER_NAME}-kubeconfig.yaml describe pod <pod-name> -n <namespace>
+```
+
+Check the logs on a failing pod by executing:
+
+```
+kubectl --kubeconfig ${CLUSTER_NAME}-kubeconfig.yaml logs <pod-name> -n <namespace>
+```
+
+You can check out the upstream [debug pods documentation][] for more
+information.
+
+## Use the built-in inspection script
+
+{{product}} ships with a script to compile a complete report on {{product}} and
+its underlying system. This is an essential tool for bug reports and for
+investigating whether a system is (or isn’t) working.
+
+The inspection script can be executed on a specific node by running the following
+commands:
+
+```
+ssh -t <user>@<node> -- sudo k8s inspect /home/<user>/inspection-report.tar.gz
+scp <user>@<node>:/home/<user>/inspection-report.tar.gz ./
+```
+
+The command output is similar to the following:
+
+```
+Collecting service information
+Running inspection on a control-plane node
+ INFO:  Service k8s.containerd is running
+ INFO:  Service k8s.kube-proxy is running
+ INFO:  Service k8s.k8s-dqlite is running
+ INFO:  Service k8s.k8sd is running
+ INFO:  Service k8s.kube-apiserver is running
+ INFO:  Service k8s.kube-controller-manager is running
+ INFO:  Service k8s.kube-scheduler is running
+ INFO:  Service k8s.kubelet is running
+Collecting registry mirror logs
+Collecting service arguments
+ INFO:  Copy service args to the final report tarball
+Collecting k8s cluster-info
+ INFO:  Copy k8s cluster-info dump to the final report tarball
+Collecting SBOM
+ INFO:  Copy SBOM to the final report tarball
+Collecting system information
+ INFO:  Copy uname to the final report tarball
+ INFO:  Copy snap diagnostics to the final report tarball
+ INFO:  Copy k8s diagnostics to the final report tarball
+Collecting networking information
+ INFO:  Copy network diagnostics to the final report tarball
+Building the report tarball
+ SUCCESS:  Report tarball is at /home/ubuntu/inspection-report.tar.gz
+```
+
+Use the report to ensure that all necessary services are running and dive into
+every aspect of the system.
+
+## Report a bug
+
+If you cannot solve your issue and believe that the fault may lie in
+{{product}}, please [file an issue on the project repository][].
+
+Help us deal effectively with issues by including the report obtained from the
+inspect script, any additional logs, and a summary of the issue.
+
+You can check out the upstream [debug documentation][] for more details on
+troubleshooting a Kubernetes cluster.
+
+<!-- Links -->
+
+[file an issue on the project repository]: https://github.com/canonical/cluster-api-k8s/issues/new/choose
+[capi-troubleshooting-reference]: ../reference/troubleshooting
+[systemd]: https://systemd.io
+[debug pods documentation]: https://kubernetes.io/docs/tasks/debug/debug-application/debug-pods
+[debug documentation]: https://kubernetes.io/docs/tasks/debug
diff --git a/docs/src/capi/reference/index.md b/docs/src/capi/reference/index.md
@@ -15,6 +15,7 @@ annotations
 Ports and services <ports-and-services>
 Community <community>
 configs
+troubleshooting
 
 ```
-Original file line number
+Diff line change
@@ Expand Up / @@ -15,6 +15,7 @@ annotations @@
     Ports and services <ports-and-services>
     Community <community>
     configs
+    troubleshooting
     ```
@@ Expand Down @@