Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAOS-16312 control: Always use --force for dmg system stop #15799

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

tanabarr
Copy link
Contributor

@tanabarr tanabarr commented Jan 27, 2025

Whenever stopping an engine process from within the control-plane, use
SIGKILL rather than asking nicely (SIGTERM). This has been requested
to try to avoid situations that could result in dataloss.

This change preserves the behaviour where ds_mgmt_drpc_prep_shutdown()
and then ds_pool_disable_exclude() will be called during a controlled
shutdown where dmg system stop is called with new --full argument.

Notable behavior changes with this PR:

  • Always performs SIGKILL on dmg system stop unless --full command
    option is supplied.
  • Will attempt prepare shutdown to disable exclusions across cluster
    during “controlled” shutdown where dmg system stop is called with
    --full option but this should be regarded as experimental and not
    for use in production environments.
  • Force option is a no-op and is retained for backward compatibility
    and future use.

Allow-unstable-test: true
Features: control

Before requesting gatekeeper:

  • Two review approvals and any prior change requests have been resolved.
  • Testing is complete and all tests passed or there is a reason documented in the PR why it should be force landed and forced-landing tag is set.
  • Features: (or Test-tag*) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.
  • Commit messages follows the guidelines outlined here.
  • Any tests skipped by the ticket being addressed have been run and passed in the PR.

Gatekeeper:

  • You are the appropriate gatekeeper to be landing the patch.
  • The PR has 2 reviews by people familiar with the code, including appropriate owners.
  • Githooks were used. If not, request that user install them and check copyright dates.
  • Checkpatch issues are resolved. Pay particular attention to ones that will show up on future PRs.
  • All builds have passed. Check non-required builds for any new compiler warnings.
  • Sufficient testing is done. Check feature pragmas and test tags and that tests skipped for the ticket are run and now pass with the changes.
  • If applicable, the PR has addressed any potential version compatibility issues.
  • Check the target branch. If it is master branch, should the PR go to a feature branch? If it is a release branch, does it have merge approval in the JIRA ticket.
  • Extra checks if forced landing is requested
    • Review comments are sufficiently resolved, particularly by prior reviewers that requested changes.
    • No new NLT or valgrind warnings. Check the classic view.
    • Quick-build or Quick-functional is not used.
  • Fix the commit message upon landing. Check the standard here. Edit it to create a single commit. If necessary, ask submitter for a new summary.

Allow-unstable-test: true
Features: control
Signed-off-by: Tom Nabarro <[email protected]>
@tanabarr tanabarr requested review from a team as code owners January 27, 2025 23:08
Copy link

Errors are Unable to load ticket data
https://daosio.atlassian.net/browse/DAOS-16312

@daosbuild1
Copy link
Collaborator

Test stage NLT on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15799/1/testReport/

@@ -177,7 +177,7 @@ func (cmd *systemEraseCmd) Execute(_ []string) error {
// systemStopCmd is the struct representing the command to shutdown DAOS system.
type systemStopCmd struct {
baseRankListCmd
Force bool `long:"force" description:"Force stop DAOS system members"`
Force bool `long:"force" description:"Currently ignored"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this changed line isn't helpful. It will have to be changed back, and it doesn't really tell the admin anything useful. "Oh, it's ignored? So then force stop doesn't work? Well then, how do I forcibly stop the system?"

You can see how this change may have the opposite effect to what you intended... I think the description should be the same and the flag should just be a no-op so that everyone doesn't have to change their scripts.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I will revert the change

@@ -191,7 +191,8 @@ func (cmd *systemStopCmd) Execute(_ []string) (errOut error) {
if err := cmd.validateHostsRanks(); err != nil {
return err
}
req := &control.SystemStopReq{Force: cmd.Force}
// DAOS-16312: Always use force when stopping ranks.
req := &control.SystemStopReq{Force: true}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not the best place to make this change. It means that only dmg users will benefit from it. Control API users will not. Better to just set it in the SystemStop RPC invoker. As an added benefit, changing it there will minimize the blast radius of this change, so that you don't have to modify the dmg tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, done

@daosbuild1
Copy link
Collaborator

Test stage Functional on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15799/1/testReport/

Features: control
Signed-off-by: Tom Nabarro <[email protected]>
@tanabarr tanabarr requested a review from mjmac January 28, 2025 13:38
Comment on lines 170 to 173
signal := syscall.SIGINT
if req.Force {
signal = syscall.SIGKILL
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why make all of these changes? You are (or someone else is) just going to have to change everything back later. This could have been a one-line change, maybe with some extra comments. What you could do is define a const, e.g. DefaultStopSignal, and then when things change back you only need to change it in one place.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because I'm not convinced it should be that simple. If we simply force all the time then we also break the call to "ds_pool_disable_exclude()" which is required for controlled shutdown as discussed here. I'm waiting for response from those that initially requested that feature and in the meantime will push both solutions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The simple version that you suggest is: #15803

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this is the version @gnailzenh is in favour of where prep_shutdown/disable_exclude behaviour is preserved for the non-force and no-ranks-specified dmg system stop controlled shutdown case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mjmac can we go with this one as an urgent fix please?

@tanabarr
Copy link
Contributor Author

build 4 triggered at P2 with allow unstable pragma after build 3 failed NLT memcheck with unrelated issues

@daosbuild1
Copy link
Collaborator

Test stage NLT on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15799/4/testReport/

@daosbuild1
Copy link
Collaborator

Test stage Functional on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15799/4/testReport/

knard38
knard38 previously approved these changes Jan 29, 2025
Copy link
Contributor

@knard38 knard38 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tanabarr
Copy link
Contributor Author

Not all hardware stages started in run 4, restart from stage hardware test -> run 5.

@tanabarr
Copy link
Contributor Author

Gatekeeper please use PR title and description in commit message when landing, TIA.

@mjmac
Copy link
Contributor

mjmac commented Jan 29, 2025

@tanabarr: It looks like run #4 had a ftest failure that needs to be addressed (can't copy/paste from the remote desktop session), in the FTEST_control.ControlLogEntry test.

Test-tag: vm,ControlLogEntry
Allow-unstable-test: true
Signed-off-by: Tom Nabarro <[email protected]>
Copy link
Contributor

@daltonbohning daltonbohning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ftest LGTM

@tanabarr
Copy link
Contributor Author

NLT is reporting 214 new issues in bio and such.. this is concerning and i have not seen that on other PRs (not sure if other gatekeepers have).

im not say it's a problem with this PR, but NLT is not a unit test C code. it tests the entire stack. anyway all im saying is I have not seen all that many failures at once in NLT.

I'm happy to take any suggestions about what needs to be changed in this PR if it's thought it is introducing NLT failures.

@mchaarawi
Copy link
Contributor

NLT is reporting 214 new issues in bio and such.. this is concerning and i have not seen that on other PRs (not sure if other gatekeepers have).

I don't see how that has anything to do with the PR. It touches two non-test source files both in the control-plane and of go type. No C code is modified. CI run was with Features: control but I'm unsure whether that affects NLT.

it's also failing the same way on the 2.6 backport.. so TBH from my perspective it actually does look to be an issue with the PR itself.
master is working fine wrt to NLT
https://build.hpdd.intel.com/blue/organizations/jenkins/daos-stack%2Fdaos/detail/master/10307/pipeline

@mchaarawi
Copy link
Contributor

mchaarawi commented Jan 31, 2025

NLT is reporting 214 new issues in bio and such.. this is concerning and i have not seen that on other PRs (not sure if other gatekeepers have).

im not say it's a problem with this PR, but NLT is not a unit test C code. it tests the entire stack. anyway all im saying is I have not seen all that many failures at once in NLT.

I'm happy to take any suggestions about what needs to be changed in this PR if it's thought it is introducing NLT failures.

most of the issues seem to be related to unfreed memory on the server:
https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15799/7/NLT_server/new/

@knard38
Copy link
Contributor

knard38 commented Jan 31, 2025

NLT is reporting 214 new issues in bio and such.. this is concerning and i have not seen that on other PRs (not sure if other gatekeepers have).

im not say it's a problem with this PR, but NLT is not a unit test C code. it tests the entire stack. anyway all im saying is I have not seen all that many failures at once in NLT.

I'm happy to take any suggestions about what needs to be changed in this PR if it's thought it is introducing NLT failures.

most of the issues seem to be related to unfreed memory on the server: https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15799/7/NLT_server/new/

NLT is reporting 214 new issues in bio and such.. this is concerning and i have not seen that on other PRs (not sure if other gatekeepers have).

im not say it's a problem with this PR, but NLT is not a unit test C code. it tests the entire stack. anyway all im saying is I have not seen all that many failures at once in NLT.

I'm happy to take any suggestions about what needs to be changed in this PR if it's thought it is introducing NLT failures.

most of the issues seem to be related to unfreed memory on the server: https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15799/7/NLT_server/new/

A possible explanation is that changing the kill signal (from SIGINT to SIGKILL) makes the end of the NLT tests to not be stopped properly and thus the memory to not be deallocated. This could explain the memory leak.
A workaround could be to add a graceful stop option which will kill the engine with SIGINT and to make the NLT tests to use this new option to stop the engine.

@mchaarawi
Copy link
Contributor

mchaarawi commented Jan 31, 2025

NLT is reporting 214 new issues in bio and such.. this is concerning and i have not seen that on other PRs (not sure if other gatekeepers have).

im not say it's a problem with this PR, but NLT is not a unit test C code. it tests the entire stack. anyway all im saying is I have not seen all that many failures at once in NLT.

I'm happy to take any suggestions about what needs to be changed in this PR if it's thought it is introducing NLT failures.

maybe we need to update NLT to for a controlled shutdown of the server? those memory checks are actually valuable to remove from testing

@tanabarr
Copy link
Contributor Author

NLT is reporting 214 new issues in bio and such.. this is concerning and i have not seen that on other PRs (not sure if other gatekeepers have).

im not say it's a problem with this PR, but NLT is not a unit test C code. it tests the entire stack. anyway all im saying is I have not seen all that many failures at once in NLT.

I'm happy to take any suggestions about what needs to be changed in this PR if it's thought it is introducing NLT failures.

maybe we need to update NLT to for a controlled shutdown of the server? those memory checks are actually valuable to remove from testing

To clarify you want me to reintroduce the option that I was explicitly asked to remove so we can perform the legacy (presumed buggy) behaviour to cover up failures in NLT that are related to a new change that we are introducing in production code. @mchaarawi @mjmac are you both happy with that approach? If so I will go ahead and make the changes.

The reason I'm being slightly cynical is that it looks like these leaks will be happening in production once we introduce this change to me that indicates removing SIGINT isn't the correct approach. It may be that it's the lesser of 2 evils e.g. memory leaking is better than data loss but it doesn't sound great either way so I want to make sure I'm being asked to do the right thing.

@mchaarawi
Copy link
Contributor

NLT is reporting 214 new issues in bio and such.. this is concerning and i have not seen that on other PRs (not sure if other gatekeepers have).

im not say it's a problem with this PR, but NLT is not a unit test C code. it tests the entire stack. anyway all im saying is I have not seen all that many failures at once in NLT.

I'm happy to take any suggestions about what needs to be changed in this PR if it's thought it is introducing NLT failures.

maybe we need to update NLT to for a controlled shutdown of the server? those memory checks are actually valuable to remove from testing

To clarify you want me to reintroduce the option that I was explicitly asked to remove so we can perform the legacy (presumed buggy) behaviour to cover up failures in NLT that are related to a new change that we are introducing in production code. @mchaarawi @mjmac are you both happy with that approach? If so I will go ahead and make the changes.

The reason I'm being slightly cynical is that it looks like these leaks will be happening in production once we introduce this change to me that indicates removing SIGINT isn't the correct approach. It may be that it's the lesser of 2 evils e.g. memory leaking is better than data loss but it doesn't sound great either way so I want to make sure I'm being asked to do the right thing.

does not sound cynical at all ;-)
there are 2 aspects to this IMO:

  1. the memcheck itself, yes we will see those leaked errors on a production system with valgrind and that is not my concern for this. But NLT also gives us more than that with this option. if there is a real growing memleak, we can discover that with this check (if the server was properly shutdown). So im less concerned with all the non-freed memory at shutdown and more with actual leaked memory during regular operation. I was discussing this with @jolivier23 and he suggested this change:
    DAOS-16312 test: remove leak check full on server valgrind #15821
    but TBH it sounds we would loose all the memleak checks in that case?

  2. what is the way moving forward in the future? is it sigkill? or do we plan to bring the graceful way back? if it's the latter then it would make absolute sense to add a test only option for the latter in the meantime so testers can try it out in the wild.

@mjmac
Copy link
Contributor

mjmac commented Jan 31, 2025

  1. what is the way moving forward in the future? is it sigkill? or do we plan to bring the graceful way back? if it's the latter then it would make absolute sense to add a test only option for the latter in the meantime so testers can try it out in the wild.

This is the question I have been asking... IMO there is no compelling reason to rush this change into master, given that no one is (or should be, anyhow) using master in production, and therefore there is no obligation to protect anyone from potentially incorrect shutdown behavior on that branch.

I understand and agree with the urgency for getting the change into the 2.6 branch, with the understanding that it may be a workaround there rather than a complete solution. My preference would be to hold off on making this change on master until all interested parties are back from holidays and we can really examine the pros and cons of making this big change in behavior for the next supported version.

@tanabarr
Copy link
Contributor Author

I will add the option and repush, we can decide whether to land master one on Monday.

Allow-unstable-test: true
Signed-off-by: Tom Nabarro <[email protected]>
@tanabarr tanabarr dismissed stale reviews from kjacque, knard38, and mjmac via 4b3104f January 31, 2025 21:28
@tanabarr
Copy link
Contributor Author

@mjmac does this look ok?
@daltonbohning @phender is it okay to run just with pr pragma or does it need control feature?

@@ -178,6 +178,7 @@ func (cmd *systemEraseCmd) Execute(_ []string) error {
type systemStopCmd struct {
baseRankListCmd
Force bool `long:"force" description:"Force stop DAOS system members"`
Full bool `long:"full" description:"Attempt a full shutdown of DAOS system"`
Copy link
Contributor

@mchaarawi mchaarawi Jan 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doc only: can we add some warning in the description here like do not use in production or something along that line?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, what about "Experimental, not for use in production"?

Copy link
Contributor

@kjacque kjacque left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find it a little hacky to maintain a dangerous buggy form of shutdown just to please NLT. I get that this may be the only option to maintain the memcheck stuff, but having a switch that says "do not use" seems like bad UI to me.

After reading the conversation, to be honest I feel like using an undocumented environment variable to trigger the "graceful" (actually data-unsafe) shutdown for NLT only is better than exposing this unwanted behavior to users.

@@ -178,6 +178,7 @@ func (cmd *systemEraseCmd) Execute(_ []string) error {
type systemStopCmd struct {
baseRankListCmd
Force bool `long:"force" description:"Force stop DAOS system members"`
Full bool `long:"full" description:"Attempt a full shutdown of DAOS system"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "full" name seems odd to me, but I suppose it doesn't matter very much. Maybe "graceful" would be more accurate, when it comes to maintaining the old behavior?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went with full after conversation with Liang. I think separating out specific procedure that is associated with a shutdown of the whole system which involves prepare shutdown/disable exclude warrents a separate flag which doesn't tie to any specific functionality e.g. graceful which may or may not be maintained seemed sensible.

Features: control
Allow-unstable-test: true
Signed-off-by: Tom Nabarro <[email protected]>
Features: control
Allow-unstable-test: true
Signed-off-by: Tom Nabarro <[email protected]>
@daosbuild1
Copy link
Collaborator

@tanabarr
Copy link
Contributor Author

tanabarr commented Feb 1, 2025

re-pushed at P2
Gatekeeper pleases PR description and title in commit message when landing. TIA

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15799/11/display/redirect

Copy link
Contributor

@knard38 knard38 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kjacque
Copy link
Contributor

kjacque commented Feb 3, 2025

If it's possible to put off landing this to master, I think it would be better to try to refine the behavior to something more permanent, rather than quickly getting a kludge in. I feel like it's unfortunately likely people will try to use the "do not use" button.

@mchaarawi
Copy link
Contributor

If it's possible to put off landing this to master, I think it would be better to try to refine the behavior to something more permanent, rather than quickly getting a kludge in. I feel like it's unfortunately likely people will try to use the "do not use" button.

im ok with putting off landing to master, but i should say your comment applies more to landing to 2.6 ;-)
no one is using master outside of developers.

i will argue that it's always best to keep the branches in sync as much as possible, but does not always have to be the case.

@kjacque
Copy link
Contributor

kjacque commented Feb 3, 2025

im ok with putting off landing to master, but i should say your comment applies more to landing to 2.6 ;-) no one is using master outside of developers.

Well, changing the default behavior is an emergency for 2.6.3, and this is one way to do that while preserving NLT behavior. But in master we have an opportunity to decide what the correct behavior is for the near future, and potentially make a different choice.

If there are no plans to fix the graceful shutdown imminently, we should use something invisible to users to enable it for testing, and/or do something else to fix the memcheck portion of NLT. If we want to fix the graceful shutdown, and make it the default again for 2.8, we probably don't actually want to change the default behavior on master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
control-plane work on the management infrastructure of the DAOS Control Plane forced-landing The PR has known failures or has intentionally reduced testing, but should still be landed. usability Changes specific to user facing tools or behaviour.
Development

Successfully merging this pull request may close these issues.

7 participants