Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add new retry automated tests #1505

Merged
merged 13 commits into from
Jan 21, 2025
Merged

feat: add new retry automated tests #1505

merged 13 commits into from
Jan 21, 2025

Conversation

rishtigupta
Copy link
Contributor

@rishtigupta rishtigupta commented Jan 15, 2025

PR Description:

  • Add new automated tests when there is full network outage vs when there is temporary network outage

Testing

1. Full-network outage:
Run momento-local on this branch using the following params:

cargo run -- --port 8080 --hostname 0.0.0.0 --return-error unavailable --error-rpcs get

and
Run the test using the following command:

npm run integration-test-retry-full-network-outage

2. Temporary-network outage:
Run momento-local on this branch using the following params:

cargo run -- --port 8080 --hostname 0.0.0.0 --return-error unavailable --error-rpcs get --error-count 2

and
Run the test using the following command:

npm run integration-test-retry-temporary-network-outage

Issue

closes https://github.com/momentohq/dev-eco-issue-tracker/issues/1093

@rishtigupta rishtigupta marked this pull request as ready for review January 15, 2025 23:49
@pgautier404
Copy link
Contributor

It looks like the test middleware is still being referenced explicitly in the cache data client. This really needs to be resolved before we can merge this PR.

Comment on lines 32 to 38
test-full-network-outage-retry-service:
@echo "Testing full network outage retry service..."
@npm run integration-test-retry-full-network-outage

test-temporary-network-outage-retry-service:
@echo "Testing temporary network outage retry service..."
@npm run integration-test-retry-temporary-network-outage
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need separate targets for these? Also the -service suffix is to distinguish services in the canaries, eg leaderboard service vs cache vs topics. You can omit that unless this is something we're going to carry over to the canaries.

Comment on lines +26 to +27
"integration-test-retry-full-network-outage": "jest retry/automated-retry-full-network-outage.test.ts --maxWorkers 1 -- useConsistentReads",
"integration-test-retry-temporary-network-outage": "jest retry/automated-retry-temporary-network-outage.test.ts --maxWorkers 1 -- useConsistentReads",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to pull these out vs include as part of the respective services?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can include these as part of cache service in future. As of now, the docker image for momento-local is not published limiting us to include these from running in the ci and hence the cache service.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although I can get rid of the scripts from the package.json to eliminate the confusion.

Copy link
Contributor

@anitarua anitarua Jan 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these tests will always be run separately from the other integration tests that run against the live service though. I think separate test targets makes sense so they don't accidentally run as part of the canaries or something, but we will eventually run them in github ci

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we were planning on running our integration tests using momento local too in future?

Copy link
Contributor

@anitarua anitarua Jan 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running the existing cache integration tests against momento-local will be a separate change the team can decide on later. And I think even then there will be some component of CI/CD that runs them against alpha.

But for the automated retry tests, they will only ever run against momento-local in CI, and in that case I think it would be useful to have separate Makefile and/or package.json test targets for starting them.

Example github actions workflows:

// on push to main, against alpha
make test-cache-service

// on pull request
make test-cache-service (could be against momento-local or alpha)
make test-retries-full-outage (against momento-local WITHOUT error count CLI arg)
make test-retries-temporary-outage (against momento-local WITH error count CLI arg)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I am adding the targets back. We can discuss this synchronously if needed?

@@ -35,7 +35,7 @@ export class FixedCountRetryStrategy implements RetryStrategy {
// null means do not retry
return null;
}
if (props.attemptNumber > this.maxAttempts) {
if (props.attemptNumber >= this.maxAttempts) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was maxAttempts intended to include the initial request?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Me and Anita tried running the tests such that max retries are 3, but it always retried 4 times which was confusing.

Copy link
Contributor

@anitarua anitarua Jan 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think initial request is included in retry count. I have some sample output from testing a toy program against momento-local which delayed get requests for 500ms before returning an error, and I observed ~2500ms total indicating 5 get requests, the initial one + the observed 4 (when it should've been an observed 3 retries).

logs copied here for reference

Client debug logs from above example:

Key stored successfully!
Time taken for set: 3ms
[2025-01-14T18:11:54.342Z] DEBUG (Momento: FixedCountRetryStrategy): Determining whether request is eligible for retry; status code: 14, request type: /cache_client.Scs/Get, attemptNumber: 0, maxAttempts: 3
[2025-01-14T18:11:54.342Z] DEBUG (Momento: FixedCountRetryStrategy): Request is eligible for retry (attempt 0 of 3, retrying immediately.
[2025-01-14T18:11:54.342Z] DEBUG (Momento: RetryInterceptor): Request eligible for retry: path: /cache_client.Scs/Get; response status code: 14; number of attempts (1); will retry in 0ms
[2025-01-14T18:11:54.344Z] DEBUG (Momento: RetryInterceptor): Retrying request: path: /cache_client.Scs/Get; deadline was: 2025-01-14T18:11:58.838Z, overall deadline is: 2025-01-14T18:11:58.838Z
[2025-01-14T18:11:54.344Z] DEBUG (Momento: RetryInterceptor): Setting next deadline (via offset of 5000 ms) to: 2025-01-14T18:11:59.344Z
[2025-01-14T18:11:54.848Z] DEBUG (Momento: FixedCountRetryStrategy): Determining whether request is eligible for retry; status code: 14, request type: /cache_client.Scs/Get, attemptNumber: 1, maxAttempts: 3
[2025-01-14T18:11:54.848Z] DEBUG (Momento: FixedCountRetryStrategy): Request is eligible for retry (attempt 1 of 3, retrying immediately.
[2025-01-14T18:11:54.848Z] DEBUG (Momento: RetryInterceptor): Request eligible for retry: path: /cache_client.Scs/Get; response status code: 14; number of attempts (2); will retry in 0ms
[2025-01-14T18:11:54.850Z] DEBUG (Momento: RetryInterceptor): Retrying request: path: /cache_client.Scs/Get; deadline was: 2025-01-14T18:11:59.344Z, overall deadline is: 2025-01-14T18:11:58.838Z
[2025-01-14T18:11:54.850Z] DEBUG (Momento: RetryInterceptor): Setting next deadline (via offset of 5000 ms) to: 2025-01-14T18:11:59.850Z
[2025-01-14T18:11:55.358Z] DEBUG (Momento: FixedCountRetryStrategy): Determining whether request is eligible for retry; status code: 14, request type: /cache_client.Scs/Get, attemptNumber: 2, maxAttempts: 3
[2025-01-14T18:11:55.358Z] DEBUG (Momento: FixedCountRetryStrategy): Request is eligible for retry (attempt 2 of 3, retrying immediately.
[2025-01-14T18:11:55.358Z] DEBUG (Momento: RetryInterceptor): Request eligible for retry: path: /cache_client.Scs/Get; response status code: 14; number of attempts (3); will retry in 0ms
[2025-01-14T18:11:55.359Z] DEBUG (Momento: RetryInterceptor): Retrying request: path: /cache_client.Scs/Get; deadline was: 2025-01-14T18:11:59.850Z, overall deadline is: 2025-01-14T18:11:58.838Z
[2025-01-14T18:11:55.359Z] DEBUG (Momento: RetryInterceptor): Setting next deadline (via offset of 5000 ms) to: 2025-01-14T18:12:00.359Z
[2025-01-14T18:11:55.868Z] DEBUG (Momento: FixedCountRetryStrategy): Determining whether request is eligible for retry; status code: 14, request type: /cache_client.Scs/Get, attemptNumber: 3, maxAttempts: 3
[2025-01-14T18:11:55.868Z] DEBUG (Momento: FixedCountRetryStrategy): Request is eligible for retry (attempt 3 of 3, retrying immediately.
[2025-01-14T18:11:55.868Z] DEBUG (Momento: RetryInterceptor): Request eligible for retry: path: /cache_client.Scs/Get; response status code: 14; number of attempts (4); will retry in 0ms
[2025-01-14T18:11:55.869Z] DEBUG (Momento: RetryInterceptor): Retrying request: path: /cache_client.Scs/Get; deadline was: 2025-01-14T18:12:00.359Z, overall deadline is: 2025-01-14T18:11:58.838Z
[2025-01-14T18:11:55.869Z] DEBUG (Momento: RetryInterceptor): Setting next deadline (via offset of 5000 ms) to: 2025-01-14T18:12:00.869Z
[2025-01-14T18:11:56.376Z] DEBUG (Momento: FixedCountRetryStrategy): Determining whether request is eligible for retry; status code: 14, request type: /cache_client.Scs/Get, attemptNumber: 4, maxAttempts: 3
[2025-01-14T18:11:56.376Z] DEBUG (Momento: FixedCountRetryStrategy): Exceeded max attempt count (3)
[2025-01-14T18:11:56.376Z] DEBUG (Momento: RetryInterceptor): Request not eligible for retry: path: /cache_client.Scs/Get; retryable status code: 14; number of attempts (4).
Error: The server was unable to handle the request; consider retrying.  If the error persists, please contact us at [email protected]: 14 UNAVAILABLE: Unavailable
Time taken for get: 2542ms
Finished example!

Copy link
Contributor

@anitarua anitarua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking good overall! just a couple of requests

Copy link
Contributor

@anitarua anitarua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

would be good to get @pgautier404 and/or @malandis to sign off on the shouldLoadLate bool changes too though

Copy link
Contributor

@pgautier404 pgautier404 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, thanks! I mostly concentrated on the middleware loading portion of the code, so I'll defer to the other reviewers on the test setup and the tests themselves.

@rishtigupta rishtigupta merged commit 04e57f9 into main Jan 21, 2025
14 checks passed
@rishtigupta rishtigupta deleted the feat/new-retry-tests branch January 21, 2025 20:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants