Skip to content

Commit

Permalink
DAOS-11955 pool: Ensure a PS is inside pool (#13046)
Browse files Browse the repository at this point in the history
* DAOS-11955 pool: Ensure a PS is inside its pool

It was found that a PS leader may enter ds_pool_plan_svc_reconfs with
itself being an undesirable replica. This may lead to an assertion
failure at "move n replicas from undesired to to_remove" in
ds_pool_plan_svc_reconfs. Moreover, such a PS leader may be outside of
the pool group, making it incapable of performing many duties that
involve collective communication.

This patch therefore ensures that a PS leader will remove undesirable PS
replicas synchronously before committing a pool map modification that
introduces new undesirable PS replicas. (If we were to keep an
undesirable PS replica, it might become a PS leader.)

  - Extend and clean up pool_svc_sched.
      * Allow pool_svc_reconf_ult to return an error, so that we can
	fail a pool map modification if its synchronous PS replica
        removal fails.
      * Allow pool_svc_reconf_ult to get an argument, so that we can
	tell pool_svc_reconf_ult whether we want a synchronous
        remove-only run or an asyncrhonous add-remove run.
      * Move pool_svc_sched.{psc_svc_rf,psc_force_notify} up to
	pool_svc.
  - Prevent pool_svc_step_up_cb from canceling in-progress
    reconfigurations by comparing pool map versions for which the
    reconfigurations are scheduled.
  - Rename POOL_GROUP_MAP_STATUS to POOL_GROUP_MAP_STATES so that we are
    consistent with the pool_map module.

Signed-off-by: Li Wei <[email protected]>
Signed-off-by: Jeff Olivier <[email protected]>
  • Loading branch information
liw authored and jolivier23 committed May 27, 2024
1 parent 868bf18 commit ef5e51c
Show file tree
Hide file tree
Showing 4 changed files with 338 additions and 179 deletions.
13 changes: 11 additions & 2 deletions src/pool/srv_internal.h
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,17 @@
#include <daos_security.h>
#include <gurt/telemetry_common.h>

/* Map status of ranks that make up the pool group */
#define POOL_GROUP_MAP_STATUS (PO_COMP_ST_UP | PO_COMP_ST_UPIN | PO_COMP_ST_DRAIN)
/* Map states of ranks that make up the pool group */
#define POOL_GROUP_MAP_STATES (PO_COMP_ST_UP | PO_COMP_ST_UPIN | PO_COMP_ST_DRAIN)

/* Map states of ranks that make up the pool service */
#define POOL_SVC_MAP_STATES (PO_COMP_ST_UP | PO_COMP_ST_UPIN)

/*
* Since we want all PS replicas to belong to the pool group,
* POOL_SVC_MAP_STATES must be a subset of POOL_GROUP_MAP_STATES.
*/
D_CASSERT((POOL_SVC_MAP_STATES & POOL_GROUP_MAP_STATES) == POOL_SVC_MAP_STATES);

/**
* Global pool metrics
Expand Down
Loading

0 comments on commit ef5e51c

Please sign in to comment.