profiler: test random execution trace collection with a fixed seed #2642

nsrip-dd · 2024-04-01T14:52:45Z

To de-flake TestExecutionTraceRandom, provide a fixed-seed random number
generator so that the results are deterministic. This is done through a
non-exported profiler option so it's easy to provide in specific test
cases (only one so far). Developers should remove this option while
working on anything that might rely on real randomness, verify that it
works as intended, and then add the option back to get reliable tests in
CI.

Fixes #2529

To de-flake TestExecutionTraceRandom, provide a fixed-seed random number generator so that the results are deterministic. This is done through a non-exported profiler option so it's easy to provide in specific test cases (only one so far). Developers should remove this option while working on anything that might rely on real randomness, verify that it works as intended, and then add the option back to get reliable tests in CI. Fixes #2529

pr-commenter · 2024-04-01T15:14:26Z

Benchmarks

Benchmark execution time: 2024-04-01 15:14:16

Comparing candidate commit 7dc4cf4 in PR branch nick.ripley/deflake-profiler-random with baseline commit 6b4d009 in branch main.

Found 0 performance improvements and 0 performance regressions! Performance is the same for 44 metrics, 0 unstable metrics.

felixge

Thanks for working on this! I'm okay with this approach.

But maybe there is a simpler fix. While looking at this PR, I was trying to figure out why we're seeing these flakes to begin with. According to the comment below, we'd be expecting to see 1 flake every 100k runs.

// We should be within 2 standard deviations ~95% of the time
// with a correct implementation. If we do this twice, then
// we have a ~99.999% chance of succeeding with a correct
// implementation.

However according to your comment here we're seeing something closer to one failure every 500 runs in reality.

I was initially worried we might be seeing a bad RNG or something, but thinking about it a little more, I think the assumption is wrong. If each doTrial() has a 5% chance of failure, then the chance that two attempts fail should be 0.05*0.05 which is 1 in 400 (aka 99.75% success chance).

That number lines up pretty well with what we seem to observe in the wild, so maybe another way to fix it is to simply go from 2 to 4 trials? That would give us 1:160k odds, which is what the test was initially expecting.

Let me know what you think 🙇

Due to a bit of math sloppiness, we were getting a ~1/500 failure rate for TestExecutionTraceRandom, which was often enough to be irritating to dd-trace-go developers. Each trial has a 95% success rate given a correct implementation. We were doing 2 trials. The comment in the test incorrectly states that 2 trials should have a 99.999% success rate. But, actuall we should expect a ~99.75% success rate for 2 trials, or a 1/400 failure rate, roughly matching what we saw. Increase the number of trials to 4. This actually gives the desired 99.999% success rate. We should expect roughly 1 failure for every 160000 runs. This is a tolerable failure rate, and lets the test remain somewhat robust, rather than use a fixed seed as considered in #2642. I have manually tested this by breaking the implementation (multiplying by an extra rand.Float64() draw) and confirmed that the test still fails reliably. Fixes #2529

nsrip-dd · 2024-04-08T14:42:39Z

I think the assumption is wrong. If each doTrial() has a 5% chance of failure, then the chance that two attempts fail should be 0.05*0.05 which is 1 in 400 (aka 99.75% success chance).

🤦 Indeed, thank you very much for double checking this! I had just opted to fix the seed to make the flake go away, but I should have revisited the numbers. I have no idea where the 99.999% came from, looking back. Your analysis is correct. I've also gone back and verified the probabilities to confirm that the "be within 2 standard deviations" probabilities are ~95% for the rates we test:

> rates <- c(1/15, 0.5, 1)
> means <- 100*rates
> stddevs <- sqrt(100*rates*(1-rates))
> lo <- means - 2*stddevs
> hi <- means + 2*stddevs
> 1 - (1 - pbinom(hi, size=100, prob=rates) + pbinom(lo, size=100, prob=rates))
[1] 0.9574683 0.9539559 0.0000000

(here, pbinom is the cumulative distribution function, and I'm computing the probability around the mean by getting the size of the tail ends and subtracting them. The 0 for probability 1 is just because lo and hi are the same, but really we're testing that probability 1 gets exactly 100 each time)

I've sent #2651 implementing your suggestion and will close this one. I think keeping some randomness will make the test more robust. If we just kept a fixed seed then I worry we might break the implementation but still have it coincidentally work for the fixed seed.

…als (#2651) Due to a bit of math sloppiness, we were getting a ~1/500 failure rate for TestExecutionTraceRandom, which was often enough to be irritating to dd-trace-go developers. Each trial has a 95% success rate given a correct implementation. We were doing 2 trials. The comment in the test incorrectly states that 2 trials should have a 99.999% success rate. But, actuall we should expect a ~99.75% success rate for 2 trials, or a 1/400 failure rate, roughly matching what we saw. Increase the number of trials to 4. This actually gives the desired 99.999% success rate. We should expect roughly 1 failure for every 160000 runs. This is a tolerable failure rate, and lets the test remain somewhat robust, rather than use a fixed seed as considered in #2642. I have manually tested this by breaking the implementation (multiplying by an extra rand.Float64() draw) and confirmed that the test still fails reliably. Fixes #2529

nsrip-dd requested a review from a team as a code owner April 1, 2024 14:52

felixge approved these changes Apr 8, 2024

View reviewed changes

nsrip-dd mentioned this pull request Apr 8, 2024

profiler: reduce TestExecutionTraceRandom flakiness by increasing trials #2651

Merged

nsrip-dd closed this Apr 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

profiler: test random execution trace collection with a fixed seed #2642

profiler: test random execution trace collection with a fixed seed #2642

nsrip-dd commented Apr 1, 2024

pr-commenter bot commented Apr 1, 2024

felixge left a comment •

edited

Loading

nsrip-dd commented Apr 8, 2024

profiler: test random execution trace collection with a fixed seed #2642

profiler: test random execution trace collection with a fixed seed #2642

Conversation

nsrip-dd commented Apr 1, 2024

pr-commenter bot commented Apr 1, 2024

Benchmarks

felixge left a comment • edited Loading

Choose a reason for hiding this comment

nsrip-dd commented Apr 8, 2024

felixge left a comment •

edited

Loading