DAOS-16312 control: Always use --force for dmg system stop #15799

tanabarr · 2025-01-27T23:08:12Z

Whenever stopping an engine process from within the control-plane, use
SIGKILL rather than asking nicely (SIGTERM). This has been requested
to try to avoid situations that could result in dataloss.

This change preserves the behaviour where ds_mgmt_drpc_prep_shutdown()
and then ds_pool_disable_exclude() will be called during a controlled
shutdown where dmg system stop is called with new --full argument.

Notable behavior changes with this PR:

Always performs SIGKILL on dmg system stop unless --full command
option is supplied.
Will attempt prepare shutdown to disable exclusions across cluster
during “controlled” shutdown where dmg system stop is called with
--full option but this should be regarded as experimental and not
for use in production environments.
Force option is a no-op and is retained for backward compatibility
and future use.

Allow-unstable-test: true
Features: control

Before requesting gatekeeper:

Two review approvals and any prior change requests have been resolved.
Testing is complete and all tests passed or there is a reason documented in the PR why it should be force landed and forced-landing tag is set.
Features: (or Test-tag*) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.
Commit messages follows the guidelines outlined here.
Any tests skipped by the ticket being addressed have been run and passed in the PR.

Gatekeeper:

Allow-unstable-test: true Features: control Signed-off-by: Tom Nabarro <[email protected]>

github-actions · 2025-01-27T23:09:52Z

Errors are Unable to load ticket data
https://daosio.atlassian.net/browse/DAOS-16312

daosbuild1 · 2025-01-27T23:55:14Z

Test stage NLT on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15799/1/testReport/

mjmac · 2025-01-28T02:00:19Z

src/control/cmd/dmg/system.go

@@ -177,7 +177,7 @@ func (cmd *systemEraseCmd) Execute(_ []string) error {
 // systemStopCmd is the struct representing the command to shutdown DAOS system.
 type systemStopCmd struct {
 	baseRankListCmd
-	Force bool `long:"force" description:"Force stop DAOS system members"`
+	Force bool `long:"force" description:"Currently ignored"`


IMO this changed line isn't helpful. It will have to be changed back, and it doesn't really tell the admin anything useful. "Oh, it's ignored? So then force stop doesn't work? Well then, how do I forcibly stop the system?"

You can see how this change may have the opposite effect to what you intended... I think the description should be the same and the flag should just be a no-op so that everyone doesn't have to change their scripts.

Ok, I will revert the change

mjmac · 2025-01-28T02:04:57Z

src/control/cmd/dmg/system.go

@@ -191,7 +191,8 @@ func (cmd *systemStopCmd) Execute(_ []string) (errOut error) {
 	if err := cmd.validateHostsRanks(); err != nil {
 		return err
 	}
-	req := &control.SystemStopReq{Force: cmd.Force}
+	// DAOS-16312: Always use force when stopping ranks.
+	req := &control.SystemStopReq{Force: true}


This is not the best place to make this change. It means that only dmg users will benefit from it. Control API users will not. Better to just set it in the SystemStop RPC invoker. As an added benefit, changing it there will minimize the blast radius of this change, so that you don't have to modify the dmg tests.

daosbuild1 · 2025-01-28T12:29:41Z

Test stage Functional on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15799/1/testReport/

Features: control Signed-off-by: Tom Nabarro <[email protected]>

mjmac · 2025-01-28T15:59:28Z

src/control/server/ctl_ranks_rpc.go

-	signal := syscall.SIGINT
-	if req.Force {
-		signal = syscall.SIGKILL
-	}


Why make all of these changes? You are (or someone else is) just going to have to change everything back later. This could have been a one-line change, maybe with some extra comments. What you could do is define a const, e.g. DefaultStopSignal, and then when things change back you only need to change it in one place.

Because I'm not convinced it should be that simple. If we simply force all the time then we also break the call to "ds_pool_disable_exclude()" which is required for controlled shutdown as discussed here. I'm waiting for response from those that initially requested that feature and in the meantime will push both solutions.

The simple version that you suggest is: #15803

Looks like this is the version @gnailzenh is in favour of where prep_shutdown/disable_exclude behaviour is preserved for the non-force and no-ranks-specified dmg system stop controlled shutdown case.

@mjmac can we go with this one as an urgent fix please?

tanabarr · 2025-01-28T17:43:20Z

build 4 triggered at P2 with allow unstable pragma after build 3 failed NLT memcheck with unrelated issues

daosbuild1 · 2025-01-28T18:24:29Z

Test stage NLT on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15799/4/testReport/

daosbuild1 · 2025-01-28T23:58:52Z

Test stage Functional on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15799/4/testReport/

knard38

LGTM

tanabarr · 2025-01-29T11:21:38Z

Not all hardware stages started in run 4, restart from stage hardware test -> run 5.

tanabarr · 2025-01-29T12:37:41Z

Gatekeeper please use PR title and description in commit message when landing, TIA.

mjmac · 2025-01-29T13:21:13Z

@tanabarr: It looks like run #4 had a ftest failure that needs to be addressed (can't copy/paste from the remote desktop session), in the FTEST_control.ControlLogEntry test.

Test-tag: vm,ControlLogEntry Allow-unstable-test: true Signed-off-by: Tom Nabarro <[email protected]>

daltonbohning

ftest LGTM

tanabarr · 2025-01-31T14:48:04Z

I'm happy to take any suggestions about what needs to be changed in this PR if it's thought it is introducing NLT failures.

mchaarawi · 2025-01-31T14:56:49Z

it's also failing the same way on the 2.6 backport.. so TBH from my perspective it actually does look to be an issue with the PR itself.
master is working fine wrt to NLT
https://build.hpdd.intel.com/blue/organizations/jenkins/daos-stack%2Fdaos/detail/master/10307/pipeline

mchaarawi · 2025-01-31T15:00:44Z

most of the issues seem to be related to unfreed memory on the server:
https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15799/7/NLT_server/new/

knard38 · 2025-01-31T15:07:32Z

A possible explanation is that changing the kill signal (from SIGINT to SIGKILL) makes the end of the NLT tests to not be stopped properly and thus the memory to not be deallocated. This could explain the memory leak.
A workaround could be to add a graceful stop option which will kill the engine with SIGINT and to make the NLT tests to use this new option to stop the engine.

mchaarawi · 2025-01-31T15:08:13Z

maybe we need to update NLT to for a controlled shutdown of the server? those memory checks are actually valuable to remove from testing

tanabarr · 2025-01-31T16:33:25Z

To clarify you want me to reintroduce the option that I was explicitly asked to remove so we can perform the legacy (presumed buggy) behaviour to cover up failures in NLT that are related to a new change that we are introducing in production code. @mchaarawi @mjmac are you both happy with that approach? If so I will go ahead and make the changes.

The reason I'm being slightly cynical is that it looks like these leaks will be happening in production once we introduce this change to me that indicates removing SIGINT isn't the correct approach. It may be that it's the lesser of 2 evils e.g. memory leaking is better than data loss but it doesn't sound great either way so I want to make sure I'm being asked to do the right thing.

mchaarawi · 2025-01-31T17:01:44Z

does not sound cynical at all ;-)
there are 2 aspects to this IMO:

the memcheck itself, yes we will see those leaked errors on a production system with valgrind and that is not my concern for this. But NLT also gives us more than that with this option. if there is a real growing memleak, we can discover that with this check (if the server was properly shutdown). So im less concerned with all the non-freed memory at shutdown and more with actual leaked memory during regular operation. I was discussing this with @jolivier23 and he suggested this change:
DAOS-16312 test: remove leak check full on server valgrind #15821
but TBH it sounds we would loose all the memleak checks in that case?
what is the way moving forward in the future? is it sigkill? or do we plan to bring the graceful way back? if it's the latter then it would make absolute sense to add a test only option for the latter in the meantime so testers can try it out in the wild.

mjmac · 2025-01-31T17:13:52Z

This is the question I have been asking... IMO there is no compelling reason to rush this change into master, given that no one is (or should be, anyhow) using master in production, and therefore there is no obligation to protect anyone from potentially incorrect shutdown behavior on that branch.

I understand and agree with the urgency for getting the change into the 2.6 branch, with the understanding that it may be a workaround there rather than a complete solution. My preference would be to hold off on making this change on master until all interested parties are back from holidays and we can really examine the pros and cons of making this big change in behavior for the next supported version.

tanabarr · 2025-01-31T17:35:53Z

I will add the option and repush, we can decide whether to land master one on Monday.

Allow-unstable-test: true Signed-off-by: Tom Nabarro <[email protected]>

tanabarr · 2025-01-31T21:30:20Z

@mjmac does this look ok?
@daltonbohning @phender is it okay to run just with pr pragma or does it need control feature?

mchaarawi · 2025-01-31T21:55:09Z

src/control/cmd/dmg/system.go

@@ -178,6 +178,7 @@ func (cmd *systemEraseCmd) Execute(_ []string) error {
 type systemStopCmd struct {
 	baseRankListCmd
 	Force bool `long:"force" description:"Force stop DAOS system members"`
+	Full  bool `long:"full" description:"Attempt a full shutdown of DAOS system"`


doc only: can we add some warning in the description here like do not use in production or something along that line?

Sure, what about "Experimental, not for use in production"?

…op-sigkill Signed-off-by: Tom Nabarro <[email protected]>

kjacque

I find it a little hacky to maintain a dangerous buggy form of shutdown just to please NLT. I get that this may be the only option to maintain the memcheck stuff, but having a switch that says "do not use" seems like bad UI to me.

After reading the conversation, to be honest I feel like using an undocumented environment variable to trigger the "graceful" (actually data-unsafe) shutdown for NLT only is better than exposing this unwanted behavior to users.

kjacque · 2025-02-01T01:50:13Z

src/control/cmd/dmg/system.go

@@ -178,6 +178,7 @@ func (cmd *systemEraseCmd) Execute(_ []string) error {
 type systemStopCmd struct {
 	baseRankListCmd
 	Force bool `long:"force" description:"Force stop DAOS system members"`
+	Full  bool `long:"full" description:"Attempt a full shutdown of DAOS system"`


The "full" name seems odd to me, but I suppose it doesn't matter very much. Maybe "graceful" would be more accurate, when it comes to maintaining the old behavior?

I went with full after conversation with Liang. I think separating out specific procedure that is associated with a shutdown of the whole system which involves prepare shutdown/disable exclude warrents a separate flag which doesn't tie to any specific functionality e.g. graceful which may or may not be maintained seemed sensible.

Features: control Allow-unstable-test: true Signed-off-by: Tom Nabarro <[email protected]>

daosbuild1 · 2025-02-01T10:04:31Z

Test stage NLT on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15799/9/execution/node/797/log

tanabarr · 2025-02-01T10:31:20Z

re-pushed at P2
Gatekeeper pleases PR description and title in commit message when landing. TIA

daosbuild1 · 2025-02-02T04:48:22Z

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15799/11/display/redirect

knard38

LGTM

kjacque · 2025-02-03T16:18:19Z

If it's possible to put off landing this to master, I think it would be better to try to refine the behavior to something more permanent, rather than quickly getting a kludge in. I feel like it's unfortunately likely people will try to use the "do not use" button.

mchaarawi · 2025-02-03T16:46:35Z

im ok with putting off landing to master, but i should say your comment applies more to landing to 2.6 ;-)
no one is using master outside of developers.

i will argue that it's always best to keep the branches in sync as much as possible, but does not always have to be the case.

kjacque · 2025-02-03T18:22:37Z

Well, changing the default behavior is an emergency for 2.6.3, and this is one way to do that while preserving NLT behavior. But in master we have an opportunity to decide what the correct behavior is for the near future, and potentially make a different choice.

If there are no plans to fix the graceful shutdown imminently, we should use something invisible to users to enable it for testing, and/or do something else to fix the memcheck portion of NLT. If we want to fix the graceful shutdown, and make it the default again for 2.8, we probably don't actually want to change the default behavior on master.

DAOS-16312 control: Always use --force for dmg system stop

8543255

Allow-unstable-test: true Features: control Signed-off-by: Tom Nabarro <[email protected]>

tanabarr requested review from a team as code owners January 27, 2025 23:08

mjmac reviewed Jan 28, 2025

View reviewed changes

use sigkill only but suppress exclude unless forced

061d971

Features: control Signed-off-by: Tom Nabarro <[email protected]>

tanabarr requested a review from mjmac January 28, 2025 13:38

mjmac reviewed Jan 28, 2025

View reviewed changes

tanabarr requested review from mjmac, liw, johannlombardi, gnailzenh, kjacque and knard38 January 28, 2025 17:30

knard38 previously approved these changes Jan 29, 2025

View reviewed changes

fix failure in log_entrye.py ftest

9a49883

Test-tag: vm,ControlLogEntry Allow-unstable-test: true Signed-off-by: Tom Nabarro <[email protected]>

tanabarr dismissed knard38’s stale review via 9a49883 January 29, 2025 17:03

tanabarr requested review from a team as code owners January 29, 2025 17:03

daltonbohning reviewed Jan 29, 2025

View reviewed changes

tanabarr requested review from knard38 and daltonbohning January 29, 2025 18:03

add full option to provide legacy behavior

4b3104f

Allow-unstable-test: true Signed-off-by: Tom Nabarro <[email protected]>

tanabarr dismissed stale reviews from kjacque, knard38, and mjmac via 4b3104f January 31, 2025 21:28

mchaarawi reviewed Jan 31, 2025

View reviewed changes

Merge remote-tracking branch 'origin/master' into tanabarr/control-st…

50caefa

…op-sigkill Signed-off-by: Tom Nabarro <[email protected]>

kjacque reviewed Feb 1, 2025

View reviewed changes

tanabarr added 2 commits February 1, 2025 09:25

improve unit test coverage and use legacy behaviour in nlt

df7a4fb

Features: control Allow-unstable-test: true Signed-off-by: Tom Nabarro <[email protected]>

add missing comma

18302f7

Features: control Allow-unstable-test: true Signed-off-by: Tom Nabarro <[email protected]>

knard38 approved these changes Feb 3, 2025

View reviewed changes

tanabarr requested review from kjacque, mchaarawi and mjmac February 3, 2025 10:08

tanabarr mentioned this pull request Feb 3, 2025

DAOS-16312 control: Always use --force for dmg system stop #15803

Closed

18 tasks

DAOS-16312 control: Always use --force for dmg system stop #15799

Are you sure you want to change the base?

DAOS-16312 control: Always use --force for dmg system stop #15799

Conversation

tanabarr commented Jan 27, 2025 • edited Loading

Before requesting gatekeeper:

Gatekeeper:

github-actions bot commented Jan 27, 2025

daosbuild1 commented Jan 27, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

daosbuild1 commented Jan 28, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tanabarr commented Jan 28, 2025

daosbuild1 commented Jan 28, 2025

daosbuild1 commented Jan 28, 2025

knard38 left a comment

Choose a reason for hiding this comment

tanabarr commented Jan 29, 2025

tanabarr commented Jan 29, 2025

mjmac commented Jan 29, 2025

daltonbohning left a comment

Choose a reason for hiding this comment

tanabarr commented Jan 31, 2025

mchaarawi commented Jan 31, 2025

mchaarawi commented Jan 31, 2025 • edited Loading

knard38 commented Jan 31, 2025 • edited Loading

mchaarawi commented Jan 31, 2025 • edited Loading

tanabarr commented Jan 31, 2025

mchaarawi commented Jan 31, 2025

mjmac commented Jan 31, 2025

tanabarr commented Jan 31, 2025

tanabarr commented Jan 31, 2025

mchaarawi Jan 31, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kjacque left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

daosbuild1 commented Feb 1, 2025

tanabarr commented Feb 1, 2025

daosbuild1 commented Feb 2, 2025

knard38 left a comment

Choose a reason for hiding this comment

kjacque commented Feb 3, 2025

mchaarawi commented Feb 3, 2025

kjacque commented Feb 3, 2025

tanabarr commented Jan 27, 2025 •

edited

Loading

mchaarawi commented Jan 31, 2025 •

edited

Loading

knard38 commented Jan 31, 2025 •

edited

Loading

mchaarawi commented Jan 31, 2025 •

edited

Loading

mchaarawi Jan 31, 2025 •

edited

Loading