[clusteragent/autoscaling] Use `PodWatcher` to update current replicas in status #28857

jennchenn · 2024-08-28T14:59:44Z

What does this PR do?

Use PodWatcher to update current replica count in DatadogPodAutoscaler status instead of Horizontal /scale sub-resource.

Motivation

If horizontal scaling has no changes/is not activated, the current replica count may be inaccurate. This change moves the logic to update the number of current replicas to the controller loop to ensure it is as up-to-date as possible.

Additional Notes

Possible Drawbacks / Trade-offs

During testing, sometimes some lag between when PodWatcher information is updated and when a scaling action happens exists. This can lead to a maximum of 5 min of inconsistent number of replicas being reported (info is fixed on next run of the controller loop).

There is currently duplicated logic between the controller and vertical controller to get the number of pods - would it make sense to combine these two to avoid calling podWatcher.GetPodsForOwner twice? The now duplicated logic:

datadog-agent/pkg/clusteragent/autoscaling/workload/controller_vertical.go

Lines 70 to 84 in 2f3053a

    
           targetGVK, err := autoscalerInternal.TargetGVK() 
        
           if err != nil { 
        
           	autoscalerInternal.SetError(err) 
        
           	return autoscaling.NoRequeue, err 
        
           } 
        
           // Get the pod owner from the workload 
        
           target := NamespacedPodOwner{ 
        
           	Namespace: autoscalerInternal.Namespace(), 
        
           	Name:      autoscalerInternal.Spec().TargetRef.Name, 
        
           	Kind:      targetGVK.Kind, 
        
           } 
        
           // Get the pods for the pod owner 
        
           pods := u.podWatcher.GetPodsForOwner(target)

Describe how to test/QA your changes

Run an autoscaling workload
Verify that the Current Replicas count in the status is updated after scaling actions (even if horizontal scaling is disabled)

pr-commenter · 2024-08-28T15:12:36Z

Test changes on VM

Use this command from test-infra-definitions to manually test this PR changes on a VM:

inv create-vm --pipeline-id=43652514 --os-family=ubuntu

Note: This applies to commit 734b934

agent-platform-auto-pr · 2024-08-28T15:17:57Z

[Fast Unit Tests Report]

On pipeline 43652514 (CI Visibility). The following jobs did not run any unit tests:

Jobs:

tests_flavor_dogstatsd_deb-x64
tests_flavor_heroku_deb-x64
tests_flavor_iot_deb-x64

If you modified Go files and expected unit tests to run in these jobs, please double check the job logs. If you think tests should have been executed reach out to #agent-devx-help

pr-commenter · 2024-08-28T15:39:17Z

Regression Detector

Regression Detector Results

Run ID: ba351497-4dbb-466e-9361-1ff7da428ecf Metrics dashboard Target profiles

Baseline: 6b60c2c
Comparison: 734b934

Performance changes are noted in the perf column of each table:

✅ = significantly better comparison variant performance
❌ = significantly worse comparison variant performance
➖ = no significant change in performance

No significant changes in experiment optimization goals

Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%

There were no significant changes in experiment optimization goals at this confidence level and effect size tolerance.

Fine details of change detection per experiment

perf	experiment	goal	Δ mean %	Δ mean % CI	trials	links
➖	tcp_syslog_to_blackhole	ingress throughput	+7.30	[-5.85, +20.46]	1	Logs
➖	file_tree	memory utilization	+1.23	[+1.15, +1.31]	1	Logs
➖	basic_py_check	% cpu utilization	+0.74	[-1.98, +3.46]	1	Logs
➖	pycheck_lots_of_tags	% cpu utilization	+0.70	[-1.87, +3.27]	1	Logs
➖	tcp_dd_logs_filter_exclude	ingress throughput	-0.00	[-0.01, +0.01]	1	Logs
➖	uds_dogstatsd_to_api	ingress throughput	-0.00	[-0.00, +0.00]	1	Logs
➖	idle	memory utilization	-0.13	[-0.17, -0.10]	1	Logs
➖	otel_to_otel_logs	ingress throughput	-0.48	[-1.28, +0.33]	1	Logs
➖	uds_dogstatsd_to_api_cpu	% cpu utilization	-0.73	[-1.51, +0.05]	1	Logs

Bounds Checks

perf	experiment	bounds_check_name	replicates_passed
✅	idle	memory_usage	10/10

Explanation

A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".

For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:

Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
Its configuration does not mark it "erratic".

vboulineau

Yes we should dedupe the code and give the necessary info to vertical controller.

vboulineau · 2024-08-29T10:54:57Z

pkg/clusteragent/autoscaling/workload/controller.go

+	targetGVK, targetErr := podAutoscalerInternal.TargetGVK()
+	if targetErr != nil {
+		podAutoscalerInternal.SetError(targetErr)
+		return autoscaling.NoRequeue, targetErr


In that case you'd miss status update

Added a check to update status and return the error at the end; priority for error during scaling > error to get target > error when updating status

Actually if get a targetErr, it's not worth going further, everything is going to fail.

Makes sense, I can wait for the changes from #28723

datadog-agent/pkg/clusteragent/autoscaling/workload/controller.go

Lines 413 to 427 in f35a065

func (c *Controller) updateAutoscalerStatusAndUnlock(ctx context.Context, key, ns, name string, err error, podAutoscalerInternal model.PodAutoscalerInternal, podAutoscaler *datadoghq.DatadogPodAutoscaler) error {

// Update status based on latest state

statusErr := c.updatePodAutoscalerStatus(ctx, podAutoscalerInternal, podAutoscaler)

if statusErr != nil {

log.Errorf("Failed to update status for PodAutoscaler: %s/%s, err: %v", ns, name, statusErr)

// We want to return the status error if none to count in the requeue retries.

if err == nil {

err = statusErr

}

}

c.store.UnlockSet(key, podAutoscalerInternal, c.ID)

return err

}

to be merged so I can make a status update and return early once targetErr is encountered

vboulineau · 2024-08-30T15:44:26Z

pkg/clusteragent/autoscaling/workload/controller.go

@@ -272,16 +273,37 @@ func (c *Controller) syncPodAutoscaler(ctx context.Context, key, ns, name string
 	// Reaching this point, we had an error in processing, clearing up global error
 	podAutoscalerInternal.SetError(nil)

+	targetGVK, targetErr := podAutoscalerInternal.TargetGVK()
+	if targetErr != nil {
+		log.Errorf("Failed to get target GVK for PodAutoscaler: %s/%s, err: %v", ns, name, targetErr)


Functional error, will be reflected in status, we should not increase number of errors logs in DCA for functional error (usually mistake in name)

…current-replicas-with-podwatcher

…update-current-replicas-with-podwatcher

jennchenn · 2024-09-24T18:32:49Z

/merge

dd-devflow · 2024-09-24T18:32:55Z

🚂 MergeQueue: pull request added to the queue

The median merge time in main is 23m.

Use /merge -c to cancel this operation!

…s in status (#28857)

jennchenn added 3 commits August 28, 2024 10:14

Update current replica count using podwatcher

5e04831

Remove update to current replicas in horizontal controller tests

99abb19

fixup! Update current replica count using podwatcher

fdad6fb

jennchenn added team/containers changelog/no-changelog labels Aug 28, 2024

jennchenn requested a review from a team as a code owner August 28, 2024 14:59

vboulineau reviewed Aug 29, 2024

View reviewed changes

Pass target to vertical controller to dedupe

ae0a5a4

vboulineau reviewed Aug 30, 2024

View reviewed changes

jennchenn added 3 commits August 30, 2024 12:48

Remove error log due to user error with setting DPA target

2e012f0

Merge remote-tracking branch 'origin/main' into jenn/CASCL-57_update-…

6dc63b4

…current-replicas-with-podwatcher

fixup! Merge remote-tracking branch 'origin/main' into jenn/CASCL-57_…

734b934

…update-current-replicas-with-podwatcher

sblumenthal approved these changes Sep 5, 2024

View reviewed changes

dd-mergequeue bot merged commit 82b135e into main Sep 24, 2024
219 checks passed

dd-mergequeue bot deleted the jenn/CASCL-57_update-current-replicas-with-podwatcher branch September 24, 2024 18:52

github-actions bot added this to the 7.59.0 milestone Sep 24, 2024

grantseltzer pushed a commit that referenced this pull request Oct 2, 2024

[clusteragent/autoscaling] Use PodWatcher to update current replica…

0c366ab

…s in status (#28857)

grantseltzer pushed a commit that referenced this pull request Oct 4, 2024

[clusteragent/autoscaling] Use PodWatcher to update current replica…

bc0d73f

…s in status (#28857)

jennchenn added the qa/done QA done before merge and regressions are covered by tests label Oct 4, 2024

jennchenn added the component/autoscaling label Nov 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[clusteragent/autoscaling] Use `PodWatcher` to update current replicas in status #28857

[clusteragent/autoscaling] Use `PodWatcher` to update current replicas in status #28857

jennchenn commented Aug 28, 2024 •

edited

Loading

pr-commenter bot commented Aug 28, 2024 •

edited

Loading

agent-platform-auto-pr bot commented Aug 28, 2024 •

edited

Loading

pr-commenter bot commented Aug 28, 2024 •

edited

Loading

Fine details of change detection per experiment

Explanation

vboulineau left a comment

vboulineau Aug 29, 2024

jennchenn Aug 29, 2024

vboulineau Aug 30, 2024

jennchenn Aug 30, 2024 •

edited

Loading

vboulineau Aug 30, 2024 •

edited

Loading

jennchenn commented Sep 24, 2024

dd-devflow bot commented Sep 24, 2024

	targetGVK, err := autoscalerInternal.TargetGVK()
	if err != nil {
	autoscalerInternal.SetError(err)
	return autoscaling.NoRequeue, err
	}

	// Get the pod owner from the workload
	target := NamespacedPodOwner{
	Namespace: autoscalerInternal.Namespace(),
	Name: autoscalerInternal.Spec().TargetRef.Name,
	Kind: targetGVK.Kind,
	}

	// Get the pods for the pod owner
	pods := u.podWatcher.GetPodsForOwner(target)

	func (c Controller) updateAutoscalerStatusAndUnlock(ctx context.Context, key, ns, name string, err error, podAutoscalerInternal model.PodAutoscalerInternal, podAutoscaler datadoghq.DatadogPodAutoscaler) error {
	// Update status based on latest state
	statusErr := c.updatePodAutoscalerStatus(ctx, podAutoscalerInternal, podAutoscaler)
	if statusErr != nil {
	log.Errorf("Failed to update status for PodAutoscaler: %s/%s, err: %v", ns, name, statusErr)

	// We want to return the status error if none to count in the requeue retries.
	if err == nil {
	err = statusErr
	}
	}

	c.store.UnlockSet(key, podAutoscalerInternal, c.ID)
	return err
	}

[clusteragent/autoscaling] Use PodWatcher to update current replicas in status #28857

[clusteragent/autoscaling] Use PodWatcher to update current replicas in status #28857

Conversation

jennchenn commented Aug 28, 2024 • edited Loading

What does this PR do?

Motivation

Additional Notes

Possible Drawbacks / Trade-offs

Describe how to test/QA your changes

pr-commenter bot commented Aug 28, 2024 • edited Loading

Test changes on VM

agent-platform-auto-pr bot commented Aug 28, 2024 • edited Loading

pr-commenter bot commented Aug 28, 2024 • edited Loading

Regression Detector

Regression Detector Results

No significant changes in experiment optimization goals

Fine details of change detection per experiment

Bounds Checks

Explanation

vboulineau left a comment

Choose a reason for hiding this comment

vboulineau Aug 29, 2024

Choose a reason for hiding this comment

jennchenn Aug 29, 2024

Choose a reason for hiding this comment

vboulineau Aug 30, 2024

Choose a reason for hiding this comment

jennchenn Aug 30, 2024 • edited Loading

Choose a reason for hiding this comment

vboulineau Aug 30, 2024 • edited Loading

Choose a reason for hiding this comment

jennchenn commented Sep 24, 2024

dd-devflow bot commented Sep 24, 2024

[clusteragent/autoscaling] Use `PodWatcher` to update current replicas in status #28857

[clusteragent/autoscaling] Use `PodWatcher` to update current replicas in status #28857

jennchenn commented Aug 28, 2024 •

edited

Loading

pr-commenter bot commented Aug 28, 2024 •

edited

Loading

agent-platform-auto-pr bot commented Aug 28, 2024 •

edited

Loading

pr-commenter bot commented Aug 28, 2024 •

edited

Loading

jennchenn Aug 30, 2024 •

edited

Loading

vboulineau Aug 30, 2024 •

edited

Loading