Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CORE-18532: Retry Transient Errors in Flow Engine #5337

Merged
merged 10 commits into from
Jan 4, 2024

Conversation

jujoramos
Copy link
Contributor

@jujoramos jujoramos commented Dec 28, 2023

Instead of automatically publishing the original event back to the
"flow.event" topic whenever a transient exception occurs when executing
the flow pipeline, automatically retry the exceptions using an
exponential backoff retry mechanism and permanently fail the flow if
the configured time or attemps is reached.

  • Create internal retry utility to manage retries in an automated
    fashion with plugable backoff strategy (constant, linear and
    exponential provided out of the box).
  • Replace existing utilities in the crypto components and tests with
    the new one and delete unused classes.
  • Update FlowEventProcessor to automatically retry transient exceptions
    through the new retry utility using an exponential backoff with a
    growth factor of 250ms.
  • Remove unnecessary code and update tests to accommodate for the new
    internal retry.

@corda-jenkins-ci02
Copy link
Contributor

corda-jenkins-ci02 bot commented Dec 28, 2023

Jenkins build for PR 5337 build 21

Build Successful:
Jar artifact version produced by this PR: 5.2.0.0-alpha-1704304261991
Helm chart version produced by this PR: 5.2.0-alpha.1704304261991
Helm chart pushed to: oci://corda-os-docker-dev.software.r3.com/helm-charts/pr-5337/corda

@jujoramos jujoramos changed the title CORE-18532: Retry Transient Flow Exceptions CORE-18532: Retry Transient Errors in Flow Engine Dec 29, 2023
@jujoramos jujoramos marked this pull request as ready for review December 29, 2023 14:34
@jujoramos jujoramos requested review from a team as code owners December 29, 2023 14:34
@corda-jenkins-ci02
Copy link
Contributor

Building E2E Tests on PR-5337

Copy link
Contributor

@dickon dickon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a bunch of comments, which need not blocking merging so if you want to disregard them I can approve anyway.

Copy link
Contributor

@kyriathar kyriathar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor comments. The crypto bits look good to me in principle.

@driessamyn
Copy link
Contributor

Did you consider using an off-the-shelf library like Resilience4J rather than rolling our own? (not saying we should, but we should evaluate pros/cons).

Also, I thought we'd look at moving away from crypto doing its own bespoke thing and consolidate how we do retries etc with the RPC pattern?

@jujoramos
Copy link
Contributor Author

Did you consider using an off-the-shelf library like Resilience4J rather than rolling our own? (not saying we should, but we should evaluate pros/cons).

I did at the beginning but our use case is quite simplistic at the moment, adding resilience4j felt like killing mosquitos with a cannon... not the most technically valid reason in the world 🤷‍♂️ .

Also, I thought we'd look at moving away from crypto doing its own bespoke thing and consolidate how we do retries etc with the RPC pattern?

Not sure what the background context is here, I've simply created a retry utility to be shared across the code base and refactored the existing crypto retry mechanism to use the new one. Any other work to replace by RPC pattern should likely be its own epic/task, not included within this PR.

@jujoramos jujoramos requested a review from a team as a code owner January 3, 2024 13:43
kyriathar
kyriathar previously approved these changes Jan 3, 2024
@conalsmith-r3
Copy link
Contributor

Also, I thought we'd look at moving away from crypto doing its own bespoke thing and consolidate how we do retries etc with the RPC pattern?

Yeah I don't remember us discussing or agreeing anything. But as I'm working on handling transient errors for external processors the crypto retry logic breaks the mould. If @dickon agrees we could add a story to remove crypto retry handling and fall back on the RPC client's retry logic, the epic is here.

Instead of automatically publishing the original event back to the
"flow.event" topic whenever a transient exception occurs when executing
the flow pipeline, automatically retry the exceptions using an
exponential backoff retry mechanism and permanently fail the flow if
the configured time or attemps is reached.

- Create internal retry utility to manage retries in an automated
  fashion with plugable backoff strategy (constant, linear and
  exponential provided out of the box).
- Replace existing utilities in the crypto components and tests with
  the new one and delete unused classes.
- Update FlowEventProcessor to automatically retry transient exceptions
  through the new retry utility using an exponential backoff with a
  growth factor of 250ms.
- Remove unnecessary code and update tests to accommodate for the new
  internal retry.
kyriathar
kyriathar previously approved these changes Jan 3, 2024
Copy link
Contributor

@kyriathar kyriathar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK (for crypto)

LWogan
LWogan previously approved these changes Jan 3, 2024
Copy link
Contributor

@LWogan LWogan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes look fine to me.

@jujoramos jujoramos dismissed stale reviews from LWogan and kyriathar via 380af16 January 3, 2024 17:50
Copy link

sonarqubecloud bot commented Jan 3, 2024

Quality Gate Passed Quality Gate passed

Kudos, no new issues were introduced!

0 New issues
0 Security Hotspots
No data about Coverage
No data about Duplication

See analysis details on SonarCloud

Copy link
Contributor

@dickon dickon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@LWogan LWogan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jujoramos jujoramos merged commit bca719e into corda:release/os/5.2 Jan 4, 2024
4 checks passed
@jujoramos jujoramos deleted the feature/CORE-18532 branch January 4, 2024 13:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants