-
Notifications
You must be signed in to change notification settings - Fork 895
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
traffic is switched before replicaset is fully available when using rollbackWindow
#3941
Comments
I'll be working on a patch which will allow us to delay traffic switching until the "old" replicaset is fully available. I'd be happy to contribute it upstream. From my first "naive" look at the source code, I would add an additional condition after: argo-rollouts/rollout/service.go Lines 269 to 278 in 7938e84
No idea what the best naming for the configuration would be though:
Please make some recommendations here :') |
Is your rollout configured with |
I would actually also check this code path: argo-rollouts/rollout/canary.go Line 379 in 53c4f12
|
No,
Thank you for pointing that out, I'll have a look there as-well. |
I wonder if #3878 will solve this issue as-well. I'll check it out. |
Describe the bug
We had an incident last Sunday. A team rolled out a new release using the canary strategy provided by Argo Rollouts.
The canary finished successfully and eventually transitioned to stable. Afterwards, the team discovered a bug and decided to roll back the release.
As the previous deployment was within the "rollback window", traffic was switched as soon as a single replica in the "new" replicaset became available. However, the replicaset were still scaling up to match the number of replicas of the previous replicaset, thus not being able to handle the load and stopped responding.
The service in question has a fairly high and undetermenistic start-up time which makes this issue more visible.
To Reproduce
1: Create a
Rollout
resource using thecanary
strategyspec.rollbackWindow.revisions: 5
spec.revisionHistoryLimit: 5
trafficRouting
:trafficRouting.traefik.weightedTraefikServiceName: xxx
2: Rollout a new change
A change which starts a canary deployment and wait until it's fully promoted and the old replicaset is scaled down.
3: Rollback the change
A "rollback" or modifications which aligns with the old replicaset.
Note:
Expected behavior
I would expect the replicaset to become fully available before traffic is switched back to the "old" replicaset. Or rather, have an option which would allow this behaviour.
Version
v1.7.2
Logs
Logs are from a local environment where the issue was later on reproduced.
Message from the maintainers:
Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.
The text was updated successfully, but these errors were encountered: