Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(probe): add startup probes #1234

Merged
merged 1 commit into from
Jul 9, 2024
Merged

Conversation

wdhif
Copy link
Member

@wdhif wdhif commented Jun 13, 2024

What does this PR do?

Adds Kubernetes startup probe support for:

  • Agent
  • Cluster Agent
  • Cluster Check Runner

Motivation

This is needed to add support for the new Kubernetes Startup probe for the Agent components.

Additional Notes

Note that currently the /startup endpoint is returning the default /health endpoint. Support for the dedicated /startup endpoint has been added in Agent 7.55. No components currently register a startup health check.

Minimum Agent Versions

Are there minimum versions of the Datadog Agent and/or Cluster Agent required?

  • Agent: vX.Y.Z
  • Cluster Agent: vX.Y.Z

Describe your test plan

  • Deploy the Operator
➜  datadog-dev git:(main) ✗ k get pod
NAME                                             READY   STATUS    RESTARTS   AGE
datadog-agent-8mmmp                              5/5     Running   0          4m10s
datadog-cluster-agent-6f58f64977-pjvhs           1/1     Running   0          4m11s
datadog-cluster-checks-runner-7f6b8b574d-rhtdk   1/1     Running   0          4m11s
datadog-operator-manager-584b8d9498-sx8bd        1/1     Running   0          4m32s
  • Check the startup probes.
    • For the Agent
➜  datadog-dev git:(main) ✗ k describe pod datadog-agent-8mmmp | grep Startup
    Startup:        http-get http://:5555/startup delay=15s timeout=5s period=15s #success=1 #failure=6
    • For the Cluster Agent
➜  datadog-dev git:(main) ✗ k describe pod datadog-cluster-agent-6f58f64977-pjvhs| grep Startup
    Startup:        http-get http://:5555/startup delay=15s timeout=5s period=15s #success=1 #failure=6
    • For the Cluster Check Runner
➜  datadog-dev git:(main) ✗ k describe pod datadog-cluster-checks-runner-7f6b8b574d-rhtdk| grep Startup
    Startup:        http-get http://:5555/startup delay=15s timeout=5s period=15s #success=1 #failure=6

We can also make sure that the endpoints are working as intended.

    • For the Agent
➜  datadog-dev git:(main) ✗ k exec -it daemonsets/datadog-agent -- curl http://localhost:5555/startup
Defaulted container "agent" out of: agent, trace-agent, security-agent, system-probe, process-agent, init-volume (init), init-config (init), seccomp-setup (init)
{"Healthy":["healthcheck","ad-config-provider-kubernetes-container-allinone","tagger-store","ad-kubeletlistener","collector-queue-15s","collector-queue-900s","ad-servicelistening","tagger-workloadmeta","workloadmeta-store","logs-agent","collector-queue-20s","ad-config-provider-endpoints-checks","aggregator","workloadmeta-puller","workloadmeta-docker","dogstatsd-main"],"Unhealthy":null}%
    • For the Cluster Agent
➜  datadog-dev git:(main) ✗ k exec -it deployments/datadog-cluster-agent -- curl http://localhost:5555/startup
{"Healthy":["healthcheck","ad-servicelistening","tagger-workloadmeta","clusterchecks-leadership","clusterchecks-dispatch","aggregator","workloadmeta-puller","tagger-store","collector-queue-15s","ad-config-provider-kubernetes-services","ad-config-provider-kubernetes-endpoints","workloadmeta-store"],"Unhealthy":null}%
    • For the Cluster Check Runner
➜  datadog-dev git:(main) ✗ k exec -it deployments/datadog-cluster-checks-runner -- curl http://localhost:5555/startup
Defaulted container "agent" out of: agent, init-config (init)
{"Healthy":["healthcheck","collector-queue-10s","ad-servicelistening","aggregator","workloadmeta-store","workloadmeta-puller","ad-config-provider-cluster-checks","collector-queue-15s"],"Unhealthy":null}%

Checklist

  • PR has at least one valid label: bug, enhancement, refactoring, documentation, tooling, and/or dependencies
  • PR has a milestone or the qa/skip-qa label

@wdhif wdhif added the enhancement New feature or request label Jun 13, 2024
@wdhif wdhif added this to the v1.7.0 milestone Jun 13, 2024
@codecov-commenter
Copy link

codecov-commenter commented Jun 13, 2024

Codecov Report

Attention: Patch coverage is 5.55556% with 17 lines in your changes missing coverage. Please review.

Project coverage is 54.84%. Comparing base (eb49fb8) to head (0e03d0a).

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1234      +/-   ##
==========================================
- Coverage   54.87%   54.84%   -0.04%     
==========================================
  Files         241      241              
  Lines       27910    27928      +18     
==========================================
+ Hits        15317    15318       +1     
- Misses      11722    11739      +17     
  Partials      871      871              
Flag Coverage Δ
unittests 54.84% <5.55%> (-0.04%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
...ers/datadogagent/component/clusteragent/default.go 61.03% <100.00%> (+0.16%) ⬆️
...ontrollers/datadogagent/component/agent/default.go 0.00% <0.00%> (ø)
...adogagent/component/clusterchecksrunner/default.go 0.68% <0.00%> (-0.01%) ⬇️
apis/datadoghq/common/common.go 14.92% <0.00%> (-4.31%) ⬇️

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update eb49fb8...0e03d0a. Read the comment docs.

@celenechang celenechang modified the milestones: v1.7.0, v1.8.0 Jun 17, 2024
@wdhif wdhif force-pushed the CONTP-247/wassim.dhif/startup-probe branch from 14b27f9 to 9bd9078 Compare June 19, 2024 12:01
@wdhif wdhif force-pushed the CONTP-247/wassim.dhif/startup-probe branch 3 times, most recently from 16e8136 to d802c04 Compare June 28, 2024 10:42
@wdhif wdhif marked this pull request as ready for review June 28, 2024 11:25
@wdhif wdhif requested review from a team as code owners June 28, 2024 11:25
apis/datadoghq/common/const.go Show resolved Hide resolved
@wdhif wdhif force-pushed the CONTP-247/wassim.dhif/startup-probe branch from f7bf0bb to 0e03d0a Compare July 9, 2024 08:51
@wdhif
Copy link
Member Author

wdhif commented Jul 9, 2024

/merge

@dd-devflow
Copy link

dd-devflow bot commented Jul 9, 2024

🚂 MergeQueue: waiting for PR to be ready

This merge request is not mergeable yet, because of pending checks/missing approvals. It will be added to the queue as soon as checks pass and/or get approvals.
Note: if you pushed new commits since the last approval, you may need additional approval.
You can remove it from the waiting list with /remove command.

Use /merge -c to cancel this operation!

@wdhif wdhif merged commit b6bd845 into main Jul 9, 2024
22 checks passed
@wdhif wdhif deleted the CONTP-247/wassim.dhif/startup-probe branch July 9, 2024 09:46
@dd-devflow
Copy link

dd-devflow bot commented Jul 9, 2024

🚂 MergeQueue: This merge request was already merged

This pull request was merged directly.

mftoure pushed a commit that referenced this pull request Oct 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants