-
Notifications
You must be signed in to change notification settings - Fork 306
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DAOS-16477 mgmt: return suspect engines for pool healthy query (#15458) #15578
Conversation
Ticket title is 'Provide admin interface to query hanging engines after massive failure' |
eb3da78
to
76b7a6a
Compare
Test stage Functional on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15578/3/testReport/ |
Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15578/4/execution/node/490/log |
76b7a6a
to
051e1de
Compare
Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15578/5/execution/node/1210/log |
* DAOS-16477 mgmt: return suspect engines for pool healthy query After significant failures, the system may leave behind some suspect engines that were marked as DEAD by the SWIM protocol, but were not excluded from the system to prevent data loss. An administrator can bring these ranks back online by restarting them. This PR aims to provide an administrative interface for querying suspect engines following a massive failure. These suspect engines can be retrieved using the daos/dmg --health-only command. An example of output of dmg pool query --health-only: Pool 6f450a68-8c7d-4da9-8900-02691650f6a2, ntarget=8, disabled=2, leader=3, version=4, state=Degraded Pool health info: - Disabled ranks: 1 - Suspect ranks: 2 - Rebuild busy, 0 objs, 0 recs Features: DmgPoolQueryRanks Signed-off-by: Wang Shilong <[email protected]> Signed-off-by: Phil Henderson <[email protected]> Co-authored-by: Phil Henderson <[email protected]>
051e1de
to
52b2dba
Compare
* DAOS-16702 rebuild: restart rebuild for a massive failure case In special massive failure case - 1. some engines down and triggered rebuild. 2. one engine participated the rebuild, not finished yet, it down again, the #failures exceeds pool RF and will not change pool map. 3. That engine restarted by administrator. In that case should recover the rebuild task on the engine, to simplify it now just abort and retry the global rebuild task. No such issue by the typical recover approach that restart the whole system including the PS leader. Signed-off-by: Xuezhao Liu <[email protected]>
Required-githooks: true Change-Id: Ifd3f793661ea9f64aa47162a791b17b4987164ba Signed-off-by: Jeff Olivier <[email protected]>
Test stage Unit Test with memcheck on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15578/6/display/redirect |
Test stage Unit Test bdev on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15578/6/display/redirect |
Test stage NLT on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15578/6/display/redirect |
Test stage Unit Test bdev with memcheck on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15578/6/display/redirect |
Test stage NLT on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15578/7/display/redirect |
Test stage Functional on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15578/9/testReport/ |
Test stage Functional on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15578/10/testReport/ |
After significant failures, the system may leave behind some suspect engines that were marked as DEAD by the SWIM protocol, but were not excluded from the system to prevent data loss. An administrator can bring these ranks back online by restarting them.
This PR aims to provide an administrative interface for querying suspect engines following a massive failure. These suspect engines can be retrieved using the daos/dmg --health-only command.
An example of output of dmg pool query --health-only:
Pool 6f450a68-8c7d-4da9-8900-02691650f6a2, ntarget=8, disabled=2, leader=3, version=4, state=Degraded Pool health info:
Features: DmgPoolQueryRanks
skip-nlt: true
Required-githooks: true
Before requesting gatekeeper:
Features:
(orTest-tag*
) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.Gatekeeper: