Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cli: push tsdump upload failed metrics to datadog logs #137250

Merged

Conversation

aa-joshi
Copy link
Contributor

Previously, we were displaying failed tsdump upload metrics on CLI output. This was inadequate because CLI output might get truncated in case of high number of metric failure. To address this, this patch ships failed metrics to datadog as part of logs so that we can see all failed uploaded metrics as part of logs.

Part of: CRDB-44835
Epic: None
Release note: None

Log snippet:
Screenshot 2024-12-11 at 1 53 10 PM

@aa-joshi aa-joshi requested review from a team as code owners December 11, 2024 10:19
@aa-joshi aa-joshi requested review from angles-n-daemons and arjunmahishi and removed request for a team December 11, 2024 10:19
Copy link

blathers-crl bot commented Dec 11, 2024

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Contributor

@arjunmahishi arjunmahishi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 2 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aa-joshi and @angles-n-daemons)


-- commits line 15 at r1:
Last 3 lines got in from the commit message template


pkg/cli/tsdump_upload.go line 447 at r1 (raw file):

			fmt.Println("\nPushing failed metrics to datadog logs...")
			for metric := range metricsUploadState.uploadFailedMetrics {
				wg.Add(1)

I have not seen a waitGroup being reused before (i.e calling Add() again after a Wait() call). Have you verified that this works?


pkg/cli/tsdump_upload.go line 449 at r1 (raw file):

				wg.Add(1)
				go func() {
					logMessage := fmt.Sprintf(logMessageFormat, metric)

The metric should be passed as an arg to the anonymous function. Or this will have flaky behaviour. Because the it's being read concurrently and the underlying value can be mutated by the for loop. I am surprised this did not get flagged by the linter 🤔

for metric := range metricsUploadState.uploadFailedMetrics {
    go func(m type) {
        ...
    }(metric)
}

OR I think you can also do this

for metric := range metricsUploadState.uploadFailedMetrics {
    metric := metric // create a locally scoped copy of metric
    go func() {
        ...
    }()
}

pkg/cli/tsdump_upload.go line 474 at r1 (raw file):

			fmt.Println("Failed to pushed some metrics to datadog logs. Please refer CLI output for all failed metrics.")
		} else {
			fmt.Println("Pushing failed metrics to datadog logs...done")

NIT
Something along this line might be more clear

Suggestion:

fmt.Println("Pushing logs of metric upload failures to datadog...done")

@aa-joshi aa-joshi force-pushed the CRDB-44835_ship_tsdump_logs_to_datadog branch from 52db9f4 to 82eeb24 Compare December 11, 2024 13:12
Copy link
Contributor Author

@aa-joshi aa-joshi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @angles-n-daemons and @arjunmahishi)


-- commits line 15 at r1:

Previously, arjunmahishi (Arjun Mahishi) wrote…

Last 3 lines got in from the commit message template

Thanks! I missed to remove them.


pkg/cli/tsdump_upload.go line 447 at r1 (raw file):

Previously, arjunmahishi (Arjun Mahishi) wrote…

I have not seen a waitGroup being reused before (i.e calling Add() again after a Wait() call). Have you verified that this works?

Yes, I have verified the behaviour in local testing.


pkg/cli/tsdump_upload.go line 449 at r1 (raw file):

Previously, arjunmahishi (Arjun Mahishi) wrote…

The metric should be passed as an arg to the anonymous function. Or this will have flaky behaviour. Because the it's being read concurrently and the underlying value can be mutated by the for loop. I am surprised this did not get flagged by the linter 🤔

for metric := range metricsUploadState.uploadFailedMetrics {
    go func(m type) {
        ...
    }(metric)
}

OR I think you can also do this

for metric := range metricsUploadState.uploadFailedMetrics {
    metric := metric // create a locally scoped copy of metric
    go func() {
        ...
    }()
}

Interesting! I didn't notice concurrency issue during my testing. The CLI & datadog line count was matching. However, it makes more sense to pass them as parameter. Thanks!


pkg/cli/tsdump_upload.go line 474 at r1 (raw file):

Previously, arjunmahishi (Arjun Mahishi) wrote…

NIT
Something along this line might be more clear

DONE

Copy link
Contributor

@arjunmahishi arjunmahishi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @angles-n-daemons)

@aa-joshi aa-joshi force-pushed the CRDB-44835_ship_tsdump_logs_to_datadog branch from 82eeb24 to add416c Compare December 12, 2024 08:47
Previously, we were displaying failed tsdump upload metrics on CLI output. This
was inadequate because CLI output might get truncated in case of high number of
metric failure. To address this, this patch ships failed metrics to datadog as
part of logs so that we can see all failed uploaded metrics as part of logs.

Part of: CRDB-44835
Epic: None
Release note: None
@aa-joshi aa-joshi force-pushed the CRDB-44835_ship_tsdump_logs_to_datadog branch from add416c to 160db6f Compare December 12, 2024 08:57
@aa-joshi
Copy link
Contributor Author

TFTR!

bors r+

@craig craig bot merged commit fada042 into cockroachdb:master Dec 13, 2024
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants