Attempt to dodge flakiness in `heavy_tasks_doesnt_block_graphql` test #2437

rafal-ch · 2024-11-14T10:53:42Z

Description

This is an attempt to resolve observed flakiness in the heavy_tasks_doesnt_block_graphql test.

When I tested the fix locally, with much smaller (250ms) timeouts and debug prints, I could observe the following outcome:

running 1 test
ERR
ERR
OK
test dos::heavy_tasks_doesnt_block_graphql ... ok

Changes:

Allow retrying the "health" check request 3 times
Because we now allow 3 tries instead of just 1
3. Reduce timeout from 5 to 4 seconds
4. Spam the node with 3 times as much requests in the background (50 -> 150).

This is an effort to make the test more resilient to the performance of the machine it is executed on.

Before requesting review

I have reviewed the code myself

netrome

I'm a bit hesitant about having retry logic here. Isn't the point of the test to ensure that the service is available despite the load? Adding retry logic seems like it defeats the purpose of the test.

What is the reason the health query can time out other than the heavy tasks actually blocking the queries? I'd take it that if this test still fails occasionally in CI, that we'd need to look into if we can do further improvements in reducing the risk of the heavy tasks blocking the other requests.

rafal-ch · 2024-11-15T09:55:14Z

I'm a bit hesitant about having retry logic here. Isn't the point of the test to ensure that the service is available despite the load? Adding retry logic seems like it defeats the purpose of the test.

The reason why I did it like that is that we do not give strict guarantees about the service responsiveness under the load. One could ask why we allowed exactly 5 seconds in the original test implementation? Does it mean that we should always respond within 5 sec., regardless the load and the machine we're running at? Probably not. I figured that having a couple of short retries would be better than just increasing the timeout arbitrarily.

we can do further improvements in reducing the risk of the heavy tasks blocking the other requests

The question here is: reduce to what level? In fact, no matter what improvements you do, you'll always observe some randomness in response time based on the load of the machine, etc.

Anyway, I'm happy to bench and work on potential improvements in this regard if we see the need. Especially if we observe this test is still failing with the retries, which means we have a deeper issue. For this PR though, the goal is to make CI runs less flaky.

netrome · 2024-11-15T12:01:50Z

I'm a bit hesitant about having retry logic here. Isn't the point of the test to ensure that the service is available despite the load? Adding retry logic seems like it defeats the purpose of the test.

The reason why I did it like that is that we do not give strict guarantees about the service responsiveness under the load. One could ask why we allowed exactly 5 seconds in the original test implementation? Does it mean that we should always respond within 5 sec., regardless the load and the machine we're running at? Probably not. I figured that having a couple of short retries would be better than just increasing the timeout arbitrarily.

we can do further improvements in reducing the risk of the heavy tasks blocking the other requests

The question here is: reduce to what level? In fact, no matter what improvements you do, you'll always observe some randomness in response time based on the load of the machine, etc.

Anyway, I'm happy to bench and work on potential improvements in this regard if we see the need. Especially if we observe this test is still failing with the retries, which means we have a deeper issue. For this PR though, the goal is to make CI runs less flaky.

Fair enough. I guess part of the problem here is that the original test is inherently flaky, and that the responsiveness behavior we're testing isn't well defined. I'd question if we should even keep the test or just disable it/remove it. But since this actually makes our CI behave better I'll approve.

xgreenx

One could ask why we allowed exactly 5 seconds in the original test implementation? Does it mean that we should always respond within 5 sec., regardless the load and the machine we're running at? Probably not.

The machine should answer in 5 seconds regardless of its load. Otherwise, the liveness check will fail.

Maybe we should consider running this test without other tests in parallel to avoid thread-blocking tasks or having too many threads.

Also, maybe we need just to increase number_of_threads in the config

xgreenx

@rafal-ch Could you check, do you still have a problem after #2401 ?

rafal-ch · 2024-11-18T14:16:47Z

@rafal-ch Could you check, do you still have a problem after #2401 ?

With the original 5 sec timeout I was never able to reproduce the problem locally (neither with #2401 nor without it). It was happening on CI only.

When I reduce the timeout the test is still flaky on my local machine.

In short: no change in the behavior observed.

AurelienFT · 2024-12-02T10:55:51Z

I didn't see this test pop in the last CI runs we had. Can we close the PR but not the branch and instantiate it again if we see the test fails again ? @rafal-ch

rafal-ch · 2024-12-02T10:56:59Z

I didn't see this test pop in the last CI runs we had. Can we close the PR but not the branch and instantiate it again if we see the test fails again ? @rafal-ch

Yeah, closing for now.

Attempt to dodge flakiness in heavy_tasks_doesnt_block_graphql test

e87165e

rafal-ch requested review from xgreenx, Dentosal and MitchTurner as code owners November 14, 2024 10:53

rafal-ch added the no changelog Skip the CI check of the changelog modification label Nov 14, 2024

rafal-ch requested a review from a team November 14, 2024 16:38

netrome reviewed Nov 15, 2024

View reviewed changes

netrome approved these changes Nov 15, 2024

View reviewed changes

xgreenx reviewed Nov 15, 2024

View reviewed changes

Merge branch 'master' into rafal_2345_flaxy_test_attempted_fix

4cca5ba

xgreenx reviewed Nov 18, 2024

View reviewed changes

rafal-ch closed this Dec 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Attempt to dodge flakiness in `heavy_tasks_doesnt_block_graphql` test #2437

Attempt to dodge flakiness in `heavy_tasks_doesnt_block_graphql` test #2437

rafal-ch commented Nov 14, 2024 •

edited

Loading

netrome left a comment

rafal-ch commented Nov 15, 2024 •

edited

Loading

netrome commented Nov 15, 2024

xgreenx left a comment

xgreenx left a comment

rafal-ch commented Nov 18, 2024

AurelienFT commented Dec 2, 2024

rafal-ch commented Dec 2, 2024

Attempt to dodge flakiness in heavy_tasks_doesnt_block_graphql test #2437

Attempt to dodge flakiness in heavy_tasks_doesnt_block_graphql test #2437

Conversation

rafal-ch commented Nov 14, 2024 • edited Loading

Description

Before requesting review

netrome left a comment

Choose a reason for hiding this comment

rafal-ch commented Nov 15, 2024 • edited Loading

netrome commented Nov 15, 2024

xgreenx left a comment

Choose a reason for hiding this comment

xgreenx left a comment

Choose a reason for hiding this comment

rafal-ch commented Nov 18, 2024

AurelienFT commented Dec 2, 2024

rafal-ch commented Dec 2, 2024

Attempt to dodge flakiness in `heavy_tasks_doesnt_block_graphql` test #2437

Attempt to dodge flakiness in `heavy_tasks_doesnt_block_graphql` test #2437

rafal-ch commented Nov 14, 2024 •

edited

Loading

rafal-ch commented Nov 15, 2024 •

edited

Loading