-
Notifications
You must be signed in to change notification settings - Fork 166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: refactor retries #3060
fix: refactor retries #3060
Conversation
✅ Deploy Preview for docs-kargo-io ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
Signed-off-by: Kent Rancourt <[email protected]>
Signed-off-by: Kent Rancourt <[email protected]>
fe0543f
to
ca48855
Compare
Signed-off-by: Kent Rancourt <[email protected]>
Signed-off-by: Kent Rancourt <[email protected]>
Signed-off-by: Kent Rancourt <[email protected]>
ca48855
to
13a191c
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3060 +/- ##
==========================================
+ Coverage 51.10% 51.22% +0.11%
==========================================
Files 283 283
Lines 25410 25456 +46
==========================================
+ Hits 12987 13039 +52
+ Misses 11724 11720 -4
+ Partials 699 697 -2 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First sweep with a few minor nits. Overall, it looks absolutely great.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs DCO and linter fixes, but otherwise LGTM. Great work @krancour 🙇
definitely following on this.. took "x" duration or running since "y" duration will add up value |
Actually, think this also needs a slight update of documentation. |
@hiddeco I mentioned docs will be in a follow up. I'll fix DCO and linter errors. This is what I get for editing directly on github on my phone! |
Signed-off-by: Kent Rancourt <[email protected]>
99a990a
to
743a0d8
Compare
Signed-off-by: Kent Rancourt <[email protected]>
Signed-off-by: Kent Rancourt <[email protected]>
Signed-off-by: Kent Rancourt <[email protected]>
Signed-off-by: Kent Rancourt <[email protected]>
Fixes #3052
Never fear... a lot of this is just codegen.
None of this is breaking because the bits that are refactored haven't been released yet. 😄
@hiddeco this follows the plan you and I discussed offline earlier.
You also get your wish of us tracking start and end time for each step. I bet @Marvin9 could make good use of this information in the future.
I did my very best to preserve the general spirit and structure of #2940.
This is successfully having all steps fail on their first error with no retries by default, but you can make a step retry after an error by explicitly configuring a error threshold > 1.
As long a step says it is running, it will be retried ad infinitum by default, but the argocd-update step has a lower default of 10m, you can make any step have a finite timeout if you explicitly configure a non-nil and non-zero duration.
As a bonus, I found and fixed two other bugs related to step execution:
Step execution engine was only ever returning health checks on success. This meant that, if in a single run, step 1 succeeded and had health checks to return, but step 2 was running and needed to be requeued, those health checks step 1 wanted to register never make it back to the reconciler.
The reconciler was only ever doing anything with the healthchecks on promotion success. A single promotion succeeds a maximum of once. This meant that even if the previous bug hadn't existed, the reconciler would ignore all health checks except those returned from its final call out the the engine. Other steps could have succeeded in previous calls and their health checks would be nowhere to be found.
We never happened to notice this, because the only step that registers health checks currently is argocd-update and that tends to be the last step in a promo process, which meant things always have worked out in its favor despite these bugs.
This still needs corresponding doc updates, but I will tackle them in a follow-up.