Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

profiler: test random execution trace collection with a fixed seed #2642

Closed
wants to merge 1 commit into from

Conversation

nsrip-dd
Copy link
Contributor

@nsrip-dd nsrip-dd commented Apr 1, 2024

To de-flake TestExecutionTraceRandom, provide a fixed-seed random number
generator so that the results are deterministic. This is done through a
non-exported profiler option so it's easy to provide in specific test
cases (only one so far). Developers should remove this option while
working on anything that might rely on real randomness, verify that it
works as intended, and then add the option back to get reliable tests in
CI.

Fixes #2529

To de-flake TestExecutionTraceRandom, provide a fixed-seed random number
generator so that the results are deterministic. This is done through a
non-exported profiler option so it's easy to provide in specific test
cases (only one so far). Developers should remove this option while
working on anything that might rely on real randomness, verify that it
works as intended, and then add the option back to get reliable tests in
CI.

Fixes #2529
@nsrip-dd nsrip-dd requested a review from a team as a code owner April 1, 2024 14:52
@pr-commenter
Copy link

pr-commenter bot commented Apr 1, 2024

Benchmarks

Benchmark execution time: 2024-04-01 15:14:16

Comparing candidate commit 7dc4cf4 in PR branch nick.ripley/deflake-profiler-random with baseline commit 6b4d009 in branch main.

Found 0 performance improvements and 0 performance regressions! Performance is the same for 44 metrics, 0 unstable metrics.

Copy link
Member

@felixge felixge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this! I'm okay with this approach.

But maybe there is a simpler fix. While looking at this PR, I was trying to figure out why we're seeing these flakes to begin with. According to the comment below, we'd be expecting to see 1 flake every 100k runs.

// We should be within 2 standard deviations ~95% of the time
// with a correct implementation. If we do this twice, then
// we have a ~99.999% chance of succeeding with a correct
// implementation.

However according to your comment here we're seeing something closer to one failure every 500 runs in reality.

I was initially worried we might be seeing a bad RNG or something, but thinking about it a little more, I think the assumption is wrong. If each doTrial() has a 5% chance of failure, then the chance that two attempts fail should be 0.05*0.05 which is 1 in 400 (aka 99.75% success chance).

That number lines up pretty well with what we seem to observe in the wild, so maybe another way to fix it is to simply go from 2 to 4 trials? That would give us 1:160k odds, which is what the test was initially expecting.

Let me know what you think 🙇

nsrip-dd added a commit that referenced this pull request Apr 8, 2024
Due to a bit of math sloppiness, we were getting a ~1/500 failure rate
for TestExecutionTraceRandom, which was often enough to be irritating to
dd-trace-go developers. Each trial has a 95% success rate given a
correct implementation. We were doing 2 trials. The comment in the test
incorrectly states that 2 trials should have a 99.999% success rate.
But, actuall we should expect a ~99.75% success rate for 2 trials, or a
1/400 failure rate, roughly matching what we saw.

Increase the number of trials to 4. This actually gives the desired
99.999% success rate. We should expect roughly 1 failure for every
160000 runs. This is a tolerable failure rate, and lets the test remain
somewhat robust, rather than use a fixed seed as considered in #2642.
I have manually tested this by breaking the implementation (multiplying
by an extra rand.Float64() draw) and confirmed that the test still fails
reliably.

Fixes #2529
@nsrip-dd
Copy link
Contributor Author

nsrip-dd commented Apr 8, 2024

I think the assumption is wrong. If each doTrial() has a 5% chance of failure, then the chance that two attempts fail should be 0.05*0.05 which is 1 in 400 (aka 99.75% success chance).

🤦 Indeed, thank you very much for double checking this! I had just opted to fix the seed to make the flake go away, but I should have revisited the numbers. I have no idea where the 99.999% came from, looking back. Your analysis is correct. I've also gone back and verified the probabilities to confirm that the "be within 2 standard deviations" probabilities are ~95% for the rates we test:

> rates <- c(1/15, 0.5, 1)
> means <- 100*rates
> stddevs <- sqrt(100*rates*(1-rates))
> lo <- means - 2*stddevs
> hi <- means + 2*stddevs
> 1 - (1 - pbinom(hi, size=100, prob=rates) + pbinom(lo, size=100, prob=rates))
[1] 0.9574683 0.9539559 0.0000000

(here, pbinom is the cumulative distribution function, and I'm computing the probability around the mean by getting the size of the tail ends and subtracting them. The 0 for probability 1 is just because lo and hi are the same, but really we're testing that probability 1 gets exactly 100 each time)

I've sent #2651 implementing your suggestion and will close this one. I think keeping some randomness will make the test more robust. If we just kept a fixed seed then I worry we might break the implementation but still have it coincidentally work for the fixed seed.

@nsrip-dd nsrip-dd closed this Apr 8, 2024
nsrip-dd added a commit that referenced this pull request Apr 11, 2024
…als (#2651)

Due to a bit of math sloppiness, we were getting a ~1/500 failure rate
for TestExecutionTraceRandom, which was often enough to be irritating to
dd-trace-go developers. Each trial has a 95% success rate given a
correct implementation. We were doing 2 trials. The comment in the test
incorrectly states that 2 trials should have a 99.999% success rate.
But, actuall we should expect a ~99.75% success rate for 2 trials, or a
1/400 failure rate, roughly matching what we saw.

Increase the number of trials to 4. This actually gives the desired
99.999% success rate. We should expect roughly 1 failure for every
160000 runs. This is a tolerable failure rate, and lets the test remain
somewhat robust, rather than use a fixed seed as considered in #2642.
I have manually tested this by breaking the implementation (multiplying
by an extra rand.Float64() draw) and confirmed that the test still fails
reliably.

Fixes #2529
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Flaky test] profiler: TestExecutionTraceRandom flake
2 participants