-
Notifications
You must be signed in to change notification settings - Fork 306
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DAOS-16312 control: Always use --force for dmg system stop #15799
base: master
Are you sure you want to change the base?
Conversation
Allow-unstable-test: true Features: control Signed-off-by: Tom Nabarro <[email protected]>
Errors are Unable to load ticket data |
Test stage NLT on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15799/1/testReport/ |
src/control/cmd/dmg/system.go
Outdated
@@ -177,7 +177,7 @@ func (cmd *systemEraseCmd) Execute(_ []string) error { | |||
// systemStopCmd is the struct representing the command to shutdown DAOS system. | |||
type systemStopCmd struct { | |||
baseRankListCmd | |||
Force bool `long:"force" description:"Force stop DAOS system members"` | |||
Force bool `long:"force" description:"Currently ignored"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO this changed line isn't helpful. It will have to be changed back, and it doesn't really tell the admin anything useful. "Oh, it's ignored? So then force stop doesn't work? Well then, how do I forcibly stop the system?"
You can see how this change may have the opposite effect to what you intended... I think the description should be the same and the flag should just be a no-op so that everyone doesn't have to change their scripts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I will revert the change
src/control/cmd/dmg/system.go
Outdated
@@ -191,7 +191,8 @@ func (cmd *systemStopCmd) Execute(_ []string) (errOut error) { | |||
if err := cmd.validateHostsRanks(); err != nil { | |||
return err | |||
} | |||
req := &control.SystemStopReq{Force: cmd.Force} | |||
// DAOS-16312: Always use force when stopping ranks. | |||
req := &control.SystemStopReq{Force: true} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not the best place to make this change. It means that only dmg users will benefit from it. Control API users will not. Better to just set it in the SystemStop RPC invoker. As an added benefit, changing it there will minimize the blast radius of this change, so that you don't have to modify the dmg tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, done
Test stage Functional on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15799/1/testReport/ |
Features: control Signed-off-by: Tom Nabarro <[email protected]>
src/control/server/ctl_ranks_rpc.go
Outdated
signal := syscall.SIGINT | ||
if req.Force { | ||
signal = syscall.SIGKILL | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why make all of these changes? You are (or someone else is) just going to have to change everything back later. This could have been a one-line change, maybe with some extra comments. What you could do is define a const, e.g. DefaultStopSignal, and then when things change back you only need to change it in one place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because I'm not convinced it should be that simple. If we simply force all the time then we also break the call to "ds_pool_disable_exclude()" which is required for controlled shutdown as discussed here. I'm waiting for response from those that initially requested that feature and in the meantime will push both solutions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The simple version that you suggest is: #15803
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this is the version @gnailzenh is in favour of where prep_shutdown/disable_exclude behaviour is preserved for the non-force and no-ranks-specified dmg system stop
controlled shutdown case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mjmac can we go with this one as an urgent fix please?
build 4 triggered at P2 with allow unstable pragma after build 3 failed NLT memcheck with unrelated issues |
Test stage NLT on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15799/4/testReport/ |
Test stage Functional on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15799/4/testReport/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Not all hardware stages started in run 4, restart from stage hardware test -> run 5. |
Gatekeeper please use PR title and description in commit message when landing, TIA. |
Test-tag: vm,ControlLogEntry Allow-unstable-test: true Signed-off-by: Tom Nabarro <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ftest LGTM
I'm happy to take any suggestions about what needs to be changed in this PR if it's thought it is introducing NLT failures. |
it's also failing the same way on the 2.6 backport.. so TBH from my perspective it actually does look to be an issue with the PR itself. |
most of the issues seem to be related to unfreed memory on the server: |
A possible explanation is that changing the kill signal (from SIGINT to SIGKILL) makes the end of the NLT tests to not be stopped properly and thus the memory to not be deallocated. This could explain the memory leak. |
maybe we need to update NLT to for a controlled shutdown of the server? those memory checks are actually valuable to remove from testing |
To clarify you want me to reintroduce the option that I was explicitly asked to remove so we can perform the legacy (presumed buggy) behaviour to cover up failures in NLT that are related to a new change that we are introducing in production code. @mchaarawi @mjmac are you both happy with that approach? If so I will go ahead and make the changes. The reason I'm being slightly cynical is that it looks like these leaks will be happening in production once we introduce this change to me that indicates removing SIGINT isn't the correct approach. It may be that it's the lesser of 2 evils e.g. memory leaking is better than data loss but it doesn't sound great either way so I want to make sure I'm being asked to do the right thing. |
does not sound cynical at all ;-)
|
This is the question I have been asking... IMO there is no compelling reason to rush this change into master, given that no one is (or should be, anyhow) using master in production, and therefore there is no obligation to protect anyone from potentially incorrect shutdown behavior on that branch. I understand and agree with the urgency for getting the change into the 2.6 branch, with the understanding that it may be a workaround there rather than a complete solution. My preference would be to hold off on making this change on master until all interested parties are back from holidays and we can really examine the pros and cons of making this big change in behavior for the next supported version. |
I will add the option and repush, we can decide whether to land master one on Monday. |
Allow-unstable-test: true Signed-off-by: Tom Nabarro <[email protected]>
@mjmac does this look ok? |
src/control/cmd/dmg/system.go
Outdated
@@ -178,6 +178,7 @@ func (cmd *systemEraseCmd) Execute(_ []string) error { | |||
type systemStopCmd struct { | |||
baseRankListCmd | |||
Force bool `long:"force" description:"Force stop DAOS system members"` | |||
Full bool `long:"full" description:"Attempt a full shutdown of DAOS system"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doc only: can we add some warning in the description here like do not use in production or something along that line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, what about "Experimental, not for use in production"?
…op-sigkill Signed-off-by: Tom Nabarro <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find it a little hacky to maintain a dangerous buggy form of shutdown just to please NLT. I get that this may be the only option to maintain the memcheck stuff, but having a switch that says "do not use" seems like bad UI to me.
After reading the conversation, to be honest I feel like using an undocumented environment variable to trigger the "graceful" (actually data-unsafe) shutdown for NLT only is better than exposing this unwanted behavior to users.
src/control/cmd/dmg/system.go
Outdated
@@ -178,6 +178,7 @@ func (cmd *systemEraseCmd) Execute(_ []string) error { | |||
type systemStopCmd struct { | |||
baseRankListCmd | |||
Force bool `long:"force" description:"Force stop DAOS system members"` | |||
Full bool `long:"full" description:"Attempt a full shutdown of DAOS system"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "full" name seems odd to me, but I suppose it doesn't matter very much. Maybe "graceful" would be more accurate, when it comes to maintaining the old behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went with full after conversation with Liang. I think separating out specific procedure that is associated with a shutdown of the whole system which involves prepare shutdown/disable exclude warrents a separate flag which doesn't tie to any specific functionality e.g. graceful which may or may not be maintained seemed sensible.
Features: control Allow-unstable-test: true Signed-off-by: Tom Nabarro <[email protected]>
Features: control Allow-unstable-test: true Signed-off-by: Tom Nabarro <[email protected]>
Test stage NLT on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15799/9/execution/node/797/log |
re-pushed at P2 |
Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15799/11/display/redirect |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
If it's possible to put off landing this to master, I think it would be better to try to refine the behavior to something more permanent, rather than quickly getting a kludge in. I feel like it's unfortunately likely people will try to use the "do not use" button. |
im ok with putting off landing to master, but i should say your comment applies more to landing to 2.6 ;-) i will argue that it's always best to keep the branches in sync as much as possible, but does not always have to be the case. |
Well, changing the default behavior is an emergency for 2.6.3, and this is one way to do that while preserving NLT behavior. But in master we have an opportunity to decide what the correct behavior is for the near future, and potentially make a different choice. If there are no plans to fix the graceful shutdown imminently, we should use something invisible to users to enable it for testing, and/or do something else to fix the memcheck portion of NLT. If we want to fix the graceful shutdown, and make it the default again for 2.8, we probably don't actually want to change the default behavior on master. |
Whenever stopping an engine process from within the control-plane, use
SIGKILL rather than asking nicely (SIGTERM). This has been requested
to try to avoid situations that could result in dataloss.
This change preserves the behaviour where ds_mgmt_drpc_prep_shutdown()
and then ds_pool_disable_exclude() will be called during a controlled
shutdown where dmg system stop is called with new --full argument.
Notable behavior changes with this PR:
option is supplied.
during “controlled” shutdown where dmg system stop is called with
--full option but this should be regarded as experimental and not
for use in production environments.
and future use.
Allow-unstable-test: true
Features: control
Before requesting gatekeeper:
Features:
(orTest-tag*
) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.Gatekeeper: