Skip to content

Is your Kubernetes cluster unhealthy? Maybe it needs a checkup πŸ₯

License

Notifications You must be signed in to change notification settings

max-rocket-internet/kube-doctor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

12 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🩺 kube-doctor

license go Report Card

Is your Kubernetes cluster unhealthy? Do your workloads have symptoms? Then maybe it needs a checkup with kube-doctor πŸ₯

$ kube-doctor --warning-symptoms --non-namespaced-resources
== Checking DaemonSet resources
πŸ‘€ DaemonSet kube-system/efs-csi-node: efs-plugin no resources specified
== Checking Deployment resources
πŸ‘€ Deployment opencost/opencost: container 'opencost' memory request and limit are not equal
πŸ‘€ Deployment default/application-one-listener: 5/8 pods are not ready
== Checking Endpoint resources
❌ Endpoint default/application-two: no ready addresses in subsets
== Checking Event resources
❌ Event datadog/datadog-x62q2: (Pod) 43.4 minutes ago: network is not ready: container runtime network not ready: NetworkReady=fals...
❌ Event default/application-one-597f47458c-fdb4r: (Pod) 1.6 minutes ago: Back-off restarting failed container
❌ Event datadog/datadog-95q6n: (Pod) 18.6 minutes ago: deleting pod for node scale down
❌ Event ip-10-10-10-10.compute.internal: (Node) 9.5 minutes ago: marked the node as toBeDeleted/unschedulable
❌ Event kube-system/cluster-autoscaler-status: (ConfigMap) 26.6 minutes ago: Scale-down: node ip-10-10-10-20.compute.internal removed with drain
== Checking HorizontalPodAutoscaler resources
πŸ‘€ HorizontalPodAutoscaler default/application-three: has condition ScalingActive=False and reason ScalingDisabled
πŸ‘€ HorizontalPodAutoscaler default/application-four: has condition ScalingLimited=True and reason TooFewReplicas
== Checking Job resources
❌ Job production/train-model: BackoffLimitExceeded: Job has reached the specified backoff limit
❌ Job production/run-analysis: DeadlineExceeded: Job was active longer than specified deadline
== Checking PersistentVolume resources
❌ PersistentVolume pgwatch-storage-pv-database: older than 5 minutes and status is not bound
== Checking Pod resources
❌ Pod default/application-two-uje-h2bhq: not running
❌ Pod datadog/datadog-555h5: status condition Ready is False
❌ Pod default/application-six: container 'app' was restarted 3.1 mins ago: 1 (exit code) Error (reason)
πŸ‘€ Pod default/application-two-lhu-4r7hn: container 'app' has been restarted 5 times
== Checking Service resources
πŸŽ‰ No symptoms found
== Checking PersistentVolume resources
⭕️ No resources found
== Checking KubeApiHealthEndpointStatus resources
πŸŽ‰ No symptoms found

This tool will check for the following symptoms:

  • Container:
    • no resources specified
    • no memory resources specified
    • no memory limit
    • memory request and limit are not equal
  • DaemonSet:
    • pods are not ready
    • pods are miss-scheduled
    • rolling update in progress
  • Deployment:
    • minimum availability not met
    • ReplicaSet update in progress
    • ReplicaSet update in progress but no progress
  • Endpoint:
    • no ready addresses in subsets
  • Event:
    • cluster-autoscaler events that are:
      • events that are not Type=Normal
      • ScaleUp and ScaleDown events
    • service-controller events that are not Type=Normal
    • default-scheduler events that are not Type=Normal and not Reason=FailedScheduling
    • kubelet events that are not Type=Normal and not Reason=Unhealthy
  • HorizontalPodAutoscaler:
    • various bad status conditions
  • Job:
    • Failed jobs within last hour
  • Kubernetes API health endpoints:
    • any bad or unknown checks
  • Node:
    • mixed kubelet versions
    • not Ready
    • older than 5 minutes and not Ready
    • any bad status conditions
  • PersistentVolumeClaim & PersistentVolume
    • older than 5 minutes and not in Bound phase
  • Pod:
    • phase that is not Running
    • any bad pod status conditions
    • various bad container status conditions
    • crashed containers in last hour
    • without owner (created from kubectl run)
  • Service:
    • LoadBalancer type without bad status

Running

By default kube-doctor will check all namespaces but it can also target a specific namespace:

kube-doctor --namespace kube-system

Or label selector;:

kube-doctor --label-selector app.kubernetes.io/name=prometheus

Or a combination of both:

kube-doctor --label-selector app.kubernetes.io/name=prometheus --namespace monitoring

Non-namespaced resources like nodes can be checked with the --non-namespaced-resources flag:

kube-doctor --non-namespaced-resources

To see other options, including debug logging, consult the help:

kube-doctor --help

Installation

Check out code and build:

git clone [email protected]:max-rocket-internet/kube-doctor.git
cd kube-doctor
go build ./... && go install ./...

Run from main branch without git:

go install github.com/max-rocket-internet/kube-doctor@latest
cd $GOPATH/pkg/mod/github.com/max-rocket-internet/kube-doctor*/
go run main.go

To download a binary, check the releases.

Contributing

Pull requests welcome πŸ’™

To run all tests:

go test ./...

Or just a single package:

go test ./.../checkup