Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SM dev - state machine #370

Merged
merged 71 commits into from
Jun 16, 2022

Conversation

bowenlan-amzn
Copy link
Member

@bowenlan-amzn bowenlan-amzn commented May 19, 2022

Issue #, if available:
#280

Description of changes:

This PR add the core logic for Snapshot Management (SM), which is implemented as a state machine.

sm_statemachine

Simplified to below now:

image

All the operations of snapshot management are placed in different SMState. After one SMState finishes all its operations, it returns the Next SMResult to inform the SMStateMachine (Context object) to save the finished SMState as currentState, so that the next scheduled job would execute the next SMState based on transitions. On the other hand, if the SMState cannot finish all operations (e.g. check if current time passes the next execution period [1]), it will return Stay SMResult to inform the SMStateMachine to save previous state as the currentState, so that the next scheduled job [2] executes the same SMState again.

Snapshot management has 2 workflows for now: creation and deletion. These 2 workflows are executed sequentially in one scheduled job run. Please refer to the SMStateMachine.next part in SMRunner. So these 2 workflows run in parallel from user's perspective, with their own set of metadata.

If your snapshot creation time is longer than the execution period, the execution period would be skipped. But if the time_limit parameter is provided, then if the creation time has passed the time_limit, state machine directly goes to next execution period w/o waiting for the previous creation to be finished.

Failure handling: this state machine won't fail and stop, it would reset the workflow (set the related metadata of this workflow to null) after 3 retries and go to the next execution period. It can also notify the user once notification is integrated.

Glossary:

  1. execution period: the period defined by user through SM policy. e.g. create a snapshot every day at 8pm, one execution period is [5/25 20:00 - 5/26 19:59]. In the future, there may be other type of execution period, e.g., index size increase - one execution period could be [10gb - 19.9gb].
  2. scheduled job: SM logic is run periodically (every 1m) by job scheduler runner. Each run is a scheduled job.

Q&A

  • Why not using 2 scheduled jobs for creation and deletion workflows?

    The complexity or risk of doing it in single job is less:

    • with one job to run 2 workflows, we need to return which workflow in SMResult so the SMStateMachine can operate on the metadata fields of this workflow.
    • with 2 jobs for 2 workflows, we probably need 2 metadatas, because if 2 jobs both updating one metadata, there could be race condition.

CheckList:

  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@codecov-commenter
Copy link

codecov-commenter commented May 21, 2022

Codecov Report

Merging #370 (312c8e8) into sm-dev (fca8469) will increase coverage by 0.53%.
The diff coverage is 81.16%.

@@             Coverage Diff              @@
##             sm-dev     #370      +/-   ##
============================================
+ Coverage     75.31%   75.85%   +0.53%     
- Complexity     2345     2428      +83     
============================================
  Files           298      309      +11     
  Lines         13517    14179     +662     
  Branches       2072     2178     +106     
============================================
+ Hits          10181    10755     +574     
- Misses         2190     2222      +32     
- Partials       1146     1202      +56     
Impacted Files Coverage Δ
...pensearch/indexmanagement/IndexManagementPlugin.kt 89.73% <ø> (-0.04%) ⬇️
...pensearch/indexmanagement/IndexManagementRunner.kt 55.55% <ø> (ø)
.../snapshotmanagement/SnapshotManagementException.kt 60.86% <60.86%> (ø)
...xmanagement/snapshotmanagement/model/SMMetadata.kt 74.25% <72.96%> (+6.37%) ⬆️
...dexmanagement/snapshotmanagement/model/SMPolicy.kt 81.99% <75.55%> (+0.43%) ⬆️
...gement/snapshotmanagement/engine/SMStateMachine.kt 77.90% <77.90%> (ø)
...nt/engine/states/deletion/DeletionFinishedState.kt 78.72% <78.72%> (ø)
...arch/indexmanagement/snapshotmanagement/SMUtils.kt 75.15% <79.04%> (+16.23%) ⬆️
...ement/engine/states/deletion/DeletionStartState.kt 80.00% <80.00%> (ø)
...t/api/transport/explain/ExplainSMPolicyResponse.kt 85.18% <83.33%> (-2.32%) ⬇️
... and 33 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fca8469...312c8e8. Read the comment docs.

Signed-off-by: bowenlan-amzn <[email protected]>
Signed-off-by: bowenlan-amzn <[email protected]>
Signed-off-by: bowenlan-amzn <[email protected]>
Signed-off-by: bowenlan-amzn <[email protected]>
Signed-off-by: bowenlan-amzn <[email protected]>
Signed-off-by: bowenlan-amzn <[email protected]>
bowenlan-amzn and others added 8 commits May 23, 2022 11:29
Signed-off-by: bowenlan-amzn <[email protected]>
Signed-off-by: bowenlan-amzn <[email protected]>
Signed-off-by: bowenlan-amzn <[email protected]>
Signed-off-by: bowenlan-amzn <[email protected]>
Signed-off-by: bowenlan-amzn <[email protected]>
Signed-off-by: bowenlan-amzn <[email protected]>
Signed-off-by: bowenlan-amzn <[email protected]>
@rishabhmaurya
Copy link
Contributor

Few high level question -

  1. What happens if the snapshot repo gets unregistered and registered back? Is it failsafe in such circumstances.
  2. If cluster gets restarted and cluster state is lost somehow, but we have the repo present in S3. Can it recover the SLM metadata state in such scenarios?

Also, I didn't get chance to look at the UTs, I'm not if we have a way here to post the overall coverage impact with this change.

@bowenlan-amzn
Copy link
Member Author

Few high level question -

1. What happens if the snapshot repo gets unregistered and registered back? Is it failsafe in such circumstances.

2. If cluster gets restarted and cluster state is lost somehow, but we have the repo present in S3. Can it recover the SLM metadata state in such scenarios?

Also, I didn't get chance to look at the UTs, I'm not if we have a way here to post the overall coverage impact with this change.

  1. Repository needs to be valid when automated creation/deletion happened, if repo is missing, we retry 3 times and skip to next execution period. User should be able to see the repo missing exception in explain API result or get notified.
  2. Metadata is saved in ISM system index, if you can restore the system index back from s3, then this should still be able to continue executing.
  3. Coverage is showing here by CodeCov SM dev - state machine #370 (comment)

Signed-off-by: bowenlan-amzn <[email protected]>
Signed-off-by: bowenlan-amzn <[email protected]>
@bowenlan-amzn
Copy link
Member Author

Left un-resolved comments will be addressed in the future snapshot management PRs due to development time conflicts.

@rishabhmaurya
Copy link
Contributor

Left un-resolved comments will be addressed in the future snapshot management PRs due to development time conflicts.

thanks! please compile a list of important ones to fix from here, before release.

@downsrob
Copy link
Contributor

Agreed, would be great if we could start generating a master list of what we want to do before release and important enhancements for post release if we don't have time!

@bowenlan-amzn bowenlan-amzn merged commit 56bb6bd into opensearch-project:sm-dev Jun 16, 2022
@bowenlan-amzn bowenlan-amzn deleted the sm-statemachine branch June 21, 2022 18:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants