Add auto deregistration of offline participants after timeout #2932

GrantPSpencer · 2024-10-01T18:30:58Z

Issues

My PR addresses the following Helix issues and references them in the PR description:
New feature for allowing controller to purge participants that have been offline for greater than user defined timeout.

Description

Here are some details about my PR, including screenshots of any UI changes:
Participants can automatically join a Helix cluster when they startup. However when they permanently go down, they must be manually removed or purged by an external workflow in order to actually leave the cluster. These stale participants can have significant negative impact on the clusters in at least 2 ways:

MAX_OFFLINE_INSTANCES_ALLOWED - If this cluster level config is exceeded, then the cluster will be put into maintenance mode.
CRUSHED Calculations - CRUSHED only guarantees evenness when all nodes in a cluster are online. The more offline nodes in the cluster, the larger the max degree of unevenness that is possible.
This causes Helix's view of the cluster's health and the actual health of the cluster to diverge.

This PR introduces a ParticipantDeregistrationPipeline which contains the ParticipantDeregistrationStage. When the stage is executed, the controller will look at all offline nodes to determine if any have exceeded the user set PARTICIPANT_DEREGISTRATION_TIMEOUT in the clusterConfig. If so, the controller will remove the participant from the cluster. If there are offline instances that have not exceeded the timeout, the controller will schedule the deregistration pipeline to run again when the next longest-offline node would exceed the deregistration timeout.

Participant Deregistration White Backround

There are still areas that need to be discussed:

Pros/cons of having this be a separate pipeline.
- The ParticipantDeregistrationStage could occur at the end of the rebalancer pipeline, but rescheduling deregistrations would have to trigger an entire rebalance pipeline as well. This would then immediately trigger another rebalance pipeline if any instances are dropped, but would keep controller writes to the ZK ensemble logically grouped to one part of the pipeline.
Restrictions and guardrails for setting the deregistration configs.
- Should deregistration be mutually exclusive with delay rebalance?
- Should we enforce our own minimum deregistration timeout that user cannot go below?
- Should we take the max of delay rebalance window and deregister timeout?

Code Changes:

Added ParticipantDeregistrationPipeline and ParticipantDeregistrationStage to handle participant deregistration logic.
- Added PARTICIPANT_DEREGISTRATION_ENABLED and PARTICIPANT_DEREGISTRATION_TIMEOUT properties to ClusterConfig.
- Added ParticipantDeregistrationTask to trigger a participantDeregistration pipeline at specified time, similar to what is done for scheduling a rebalance
- New ClusterEventType of ParticipantDeregistration for scheduled trigger of deregistration pipeline
Updated ZkTestBase with addParticipant and dropParticipant methods to be leveraged across different test classes.
- Subsequently changed TestAddResourceWhenRequireDelayedRebalanceOverwrite.java and TestForceKillInstance.java to leverage these changes.

Tests

The following tests are written for this issue:
TestParticipantDeregistrationStage.java
The following is the result of the "mvn test" command on the appropriate module:

$ mvn test -o -Dtest=TestParticipantDeregistrationStage -pl=helix-core

[INFO] Tests run: 5, Failures: 0, Errors: 0, Skipped: 0
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  01:19 min
[INFO] Finished at: 2024-10-01T11:19:05-07:00
[INFO] ------------------------------------------------------------------------

Changes that Break Backward Compatibility (Optional)

My PR contains changes that break backward compatibility or previous assumptions for certain methods or API. They include:

N/A. Feature is optional and is default off

Commits

My commits all reference appropriate Apache Helix GitHub issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
1. Subject is separated from body by a blank line
2. Subject is limited to 50 characters (not including Jira issue reference)
3. Subject does not end with a period
4. Subject uses the imperative mood ("add", not "adding")
5. Body wraps at 72 characters
6. Body explains "what" and "why", not "how"

Code Quality

My diff has been formatted using helix-style.xml
(helix-style-intellij.xml if IntelliJ IDE is used)

junkaixue · 2024-10-01T20:04:23Z

helix-core/src/main/java/org/apache/helix/controller/GenericHelixController.java

+    _participantDeregistrationTimer
+        = new Timer("GenericHelixController_" + _clusterName + "_participantDeregistration_Timer", true);


Since it was a same pipeline, I do suggest to leverage the same timer. This operation also counts as the ondemand rebalancing.

Hi @junkaixue , thanks for the review! Just want to clarify your feedback:
(1) Move the ParticipantDeregistrationStage to the end of the rebalance pipeline - currently it belongs to a new, separate pipeline = ParticipantDeregistrationPipeline
(2) Leverage scheduleOnDemandRebalance as participant deregistration now occurs as part of onDemand rebalance

junkaixue · 2024-10-01T20:05:57Z

helix-core/src/main/java/org/apache/helix/controller/GenericHelixController.java

+
+  Timer _participantDeregistrationTimer = null;
+  AtomicReference<ParticipantDeregistrationTask> _nextParticipantDeregistrationTask = new AtomicReference<>();
+  class ParticipantDeregistrationTask extends TimerTask {


We can still leverage the original task with a specified time. It does not have to be compute logic in the task. It can be part of the pipeline stage. Also, you can use the an util "Schedule Pipeline" something. It is in the code base somewhere.

Left a question in your above comment. Agree that we do not need a new task + timer if we move participant deregistration to be part of the rebalancer.

My view is that increasing the frequency of onDemand rebalances will add more work to the cluster rather than having a separate pipeline for participant deregistration. There may also be a lot of on demand rebalances that are called just for deregistering a participant, which will then trigger its own rebalance.

Are the main benefits of moving deregistration stage to the rebalance pipeline reducing code and keeping ZK writes grouped to end of rebalancer? Thank you for the advice

Unify the zk write stage and minimize the write is the major consideration.

But the more important is to reduce the heaviness of the code. For this simple pipeline trigger and we create a "Task" class is too much.

junkaixue · 2024-10-07T17:12:02Z

helix-core/src/main/java/org/apache/helix/model/ClusterConfig.java

+
+  public boolean isParticipantDeregistrationEnabled() {
+    return _record.getBooleanField(ClusterConfigProperty.PARTICIPANT_DEREGISTRATION_ENABLED.name(),
+        false);
+  }
+
+  public void setParticipantDeregistrationEnabled(boolean enabled) {
+    _record.setBooleanField(ClusterConfigProperty.PARTICIPANT_DEREGISTRATION_ENABLED.name(), enabled);
+  }
+
+  public long getParticipantDeregistrationTimeout() {
+    return _record.getLongField(ClusterConfigProperty.PARTICIPANT_DEREGISTRATION_TIMEOUT.name(),
+        -1);
+  }
+
+  public void setParticipantDeregistrationTimeout(long timeout) {
+    _record.setLongField(ClusterConfigProperty.PARTICIPANT_DEREGISTRATION_TIMEOUT.name(), timeout);
+  }


Let's simplify this? If the timeout is not setup or -1, the feature is not enabled? Other cluster config is resource level, so some of resources may not have the configurations. But this is instance level. I dont think we need a switcher to control the feature.

Adjusted to only rely on the timeout value. If timeout is -1 or lower, then it is turned off. If there is no timeout set, the default value is -1 which is interpreted as disabled

junkaixue · 2024-10-07T17:12:44Z

helix-core/src/main/java/org/apache/helix/controller/stages/ParticipantDeregistrationStage.java

+      Long offlineTime = entry.getValue();
+      long deregisterTime = offlineTime + deregisterDelay;
+
+      // Skip if instance is still online


You should rely on LiveInstances instead of history.

Adjusted this check to be:

if (cache.getLiveInstances().containsKey(instanceName)) {

junkaixue · 2024-10-07T17:14:51Z

helix-core/src/main/java/org/apache/helix/controller/stages/ParticipantDeregistrationStage.java

+        if (deregisterTime < earliestDeregisterTime) {
+          earliestDeregisterTime = deregisterTime;
+        }


How this can happen? The else clause condition is deregisterTime > stageStartTime. It will never be triggered.

Adjusted this to remove the if and clarified variable name

the if branch of (deregisterTime <= stageStartTime) should catch all nodes that need to be deregistered during this run

the else will include all nodes that need to be deregistered at some point in the future. We then look for the node that is next in line to be deregistered and use that time to schedule the next deregistration

now logic is:

// If deregister time is in the past, deregister the instance if (deregisterTime <= stageStartTime) { participantsToDeregister.add(instanceName); } else { // Otherwise, find the next earliest deregister time nextDeregisterTime = Math.min(nextDeregisterTime, deregisterTime); }

junkaixue · 2024-10-07T17:17:10Z

helix-core/src/main/java/org/apache/helix/controller/stages/ParticipantDeregistrationStage.java

+    if (earliestDeregisterTime != Long.MAX_VALUE) {
+      long delay = earliestDeregisterTime - stageStartTime;
+      scheduleOnDemandPipeline(manager.getClusterName(), delay);
+    }


This is not accurate... If this stage is very slow, it already passed the time for next deregistration. Use the next closest deregister instance time.

Good catch, thank you for calling this out.
Updated calculation of delay

long delay = nextDeregisterTime - System.currentTimeMillis();

Do you think there is an argument for the check of whether a participant should be deregistered

if (deregisterTime <= stageStartTime) {

should also utilize currentTime rather than stageStartTime

GrantPSpencer · 2024-11-21T00:31:58Z

Hi @junkaixue, sorry for the huge delay on responding to feedback. Had 2x oncall rotations, then vacation and got sick. Please take a look at this when you can. Ideally, we would like this to get incorporated into the upcoming release if possible

junkaixue reviewed Oct 1, 2024

View reviewed changes

GrantPSpencer force-pushed the participant-auto-deregistration branch from 7d1c222 to ea94999 Compare October 4, 2024 21:25

junkaixue reviewed Oct 7, 2024

View reviewed changes

Add auto deregistration of offline participants after timeout

8d2be71

GrantPSpencer force-pushed the participant-auto-deregistration branch from ea94999 to 8e2e7ae Compare November 21, 2024 00:07

respond reviewer feedback

45190e9

GrantPSpencer force-pushed the participant-auto-deregistration branch from 8e2e7ae to 45190e9 Compare November 21, 2024 17:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add auto deregistration of offline participants after timeout #2932

Add auto deregistration of offline participants after timeout #2932

GrantPSpencer commented Oct 1, 2024 •

edited by junkaixue

Loading

junkaixue Oct 1, 2024

GrantPSpencer Oct 2, 2024

junkaixue Oct 4, 2024

junkaixue Oct 1, 2024

GrantPSpencer Oct 2, 2024

junkaixue Oct 4, 2024

junkaixue Oct 7, 2024

GrantPSpencer Nov 21, 2024

junkaixue Oct 7, 2024

GrantPSpencer Nov 21, 2024

junkaixue Oct 7, 2024

GrantPSpencer Nov 21, 2024

junkaixue Oct 7, 2024

GrantPSpencer Nov 21, 2024

GrantPSpencer commented Nov 21, 2024

		_participantDeregistrationTimer
		= new Timer("GenericHelixController_" + _clusterName + "_participantDeregistration_Timer", true);

Add auto deregistration of offline participants after timeout #2932

Are you sure you want to change the base?

Add auto deregistration of offline participants after timeout #2932

Conversation

GrantPSpencer commented Oct 1, 2024 • edited by junkaixue Loading

Issues

Description

Tests

Changes that Break Backward Compatibility (Optional)

Commits

Code Quality

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

GrantPSpencer commented Nov 21, 2024

GrantPSpencer commented Oct 1, 2024 •

edited by junkaixue

Loading