Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add auto deregistration of offline participants after timeout #2932

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

GrantPSpencer
Copy link
Contributor

@GrantPSpencer GrantPSpencer commented Oct 1, 2024

Issues

  • My PR addresses the following Helix issues and references them in the PR description:
    New feature for allowing controller to purge participants that have been offline for greater than user defined timeout.

Description

  • Here are some details about my PR, including screenshots of any UI changes:
    Participants can automatically join a Helix cluster when they startup. However when they permanently go down, they must be manually removed or purged by an external workflow in order to actually leave the cluster. These stale participants can have significant negative impact on the clusters in at least 2 ways:
  • MAX_OFFLINE_INSTANCES_ALLOWED - If this cluster level config is exceeded, then the cluster will be put into maintenance mode.
  • CRUSHED Calculations - CRUSHED only guarantees evenness when all nodes in a cluster are online. The more offline nodes in the cluster, the larger the max degree of unevenness that is possible.
    This causes Helix's view of the cluster's health and the actual health of the cluster to diverge.

This PR introduces a ParticipantDeregistrationPipeline which contains the ParticipantDeregistrationStage. When the stage is executed, the controller will look at all offline nodes to determine if any have exceeded the user set PARTICIPANT_DEREGISTRATION_TIMEOUT in the clusterConfig. If so, the controller will remove the participant from the cluster. If there are offline instances that have not exceeded the timeout, the controller will schedule the deregistration pipeline to run again when the next longest-offline node would exceed the deregistration timeout.

Participant Deregistration White Backround

There are still areas that need to be discussed:

  1. Pros/cons of having this be a separate pipeline.
    • The ParticipantDeregistrationStage could occur at the end of the rebalancer pipeline, but rescheduling deregistrations would have to trigger an entire rebalance pipeline as well. This would then immediately trigger another rebalance pipeline if any instances are dropped, but would keep controller writes to the ZK ensemble logically grouped to one part of the pipeline.
  2. Restrictions and guardrails for setting the deregistration configs.
    • Should deregistration be mutually exclusive with delay rebalance?
    • Should we enforce our own minimum deregistration timeout that user cannot go below?
    • Should we take the max of delay rebalance window and deregister timeout?

Code Changes:

  • Added ParticipantDeregistrationPipeline and ParticipantDeregistrationStage to handle participant deregistration logic.

    • Added PARTICIPANT_DEREGISTRATION_ENABLED and PARTICIPANT_DEREGISTRATION_TIMEOUT properties to ClusterConfig.
    • Added ParticipantDeregistrationTask to trigger a participantDeregistration pipeline at specified time, similar to what is done for scheduling a rebalance
    • New ClusterEventType of ParticipantDeregistration for scheduled trigger of deregistration pipeline
  • Updated ZkTestBase with addParticipant and dropParticipant methods to be leveraged across different test classes.

    • Subsequently changed TestAddResourceWhenRequireDelayedRebalanceOverwrite.java and TestForceKillInstance.java to leverage these changes.

Tests

  • The following tests are written for this issue:
    TestParticipantDeregistrationStage.java

  • The following is the result of the "mvn test" command on the appropriate module:

$ mvn test -o -Dtest=TestParticipantDeregistrationStage -pl=helix-core

[INFO] Tests run: 5, Failures: 0, Errors: 0, Skipped: 0
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  01:19 min
[INFO] Finished at: 2024-10-01T11:19:05-07:00
[INFO] ------------------------------------------------------------------------

Changes that Break Backward Compatibility (Optional)

  • My PR contains changes that break backward compatibility or previous assumptions for certain methods or API. They include:

N/A. Feature is optional and is default off

Commits

  • My commits all reference appropriate Apache Helix GitHub issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Code Quality

  • My diff has been formatted using helix-style.xml
    (helix-style-intellij.xml if IntelliJ IDE is used)

Comment on lines 718 to 719
_participantDeregistrationTimer
= new Timer("GenericHelixController_" + _clusterName + "_participantDeregistration_Timer", true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it was a same pipeline, I do suggest to leverage the same timer. This operation also counts as the ondemand rebalancing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @junkaixue , thanks for the review! Just want to clarify your feedback:
(1) Move the ParticipantDeregistrationStage to the end of the rebalance pipeline - currently it belongs to a new, separate pipeline = ParticipantDeregistrationPipeline
(2) Leverage scheduleOnDemandRebalance as participant deregistration now occurs as part of onDemand rebalance

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.


Timer _participantDeregistrationTimer = null;
AtomicReference<ParticipantDeregistrationTask> _nextParticipantDeregistrationTask = new AtomicReference<>();
class ParticipantDeregistrationTask extends TimerTask {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can still leverage the original task with a specified time. It does not have to be compute logic in the task. It can be part of the pipeline stage. Also, you can use the an util "Schedule Pipeline" something. It is in the code base somewhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a question in your above comment. Agree that we do not need a new task + timer if we move participant deregistration to be part of the rebalancer.

My view is that increasing the frequency of onDemand rebalances will add more work to the cluster rather than having a separate pipeline for participant deregistration. There may also be a lot of on demand rebalances that are called just for deregistering a participant, which will then trigger its own rebalance.

Are the main benefits of moving deregistration stage to the rebalance pipeline reducing code and keeping ZK writes grouped to end of rebalancer? Thank you for the advice

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unify the zk write stage and minimize the write is the major consideration.

But the more important is to reduce the heaviness of the code. For this simple pipeline trigger and we create a "Task" class is too much.

@GrantPSpencer GrantPSpencer force-pushed the participant-auto-deregistration branch from 7d1c222 to ea94999 Compare October 4, 2024 21:25
Comment on lines 1260 to 1267

public boolean isParticipantDeregistrationEnabled() {
return _record.getBooleanField(ClusterConfigProperty.PARTICIPANT_DEREGISTRATION_ENABLED.name(),
false);
}

public void setParticipantDeregistrationEnabled(boolean enabled) {
_record.setBooleanField(ClusterConfigProperty.PARTICIPANT_DEREGISTRATION_ENABLED.name(), enabled);
}

public long getParticipantDeregistrationTimeout() {
return _record.getLongField(ClusterConfigProperty.PARTICIPANT_DEREGISTRATION_TIMEOUT.name(),
-1);
}

public void setParticipantDeregistrationTimeout(long timeout) {
_record.setLongField(ClusterConfigProperty.PARTICIPANT_DEREGISTRATION_TIMEOUT.name(), timeout);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's simplify this? If the timeout is not setup or -1, the feature is not enabled? Other cluster config is resource level, so some of resources may not have the configurations. But this is instance level. I dont think we need a switcher to control the feature.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adjusted to only rely on the timeout value. If timeout is -1 or lower, then it is turned off. If there is no timeout set, the default value is -1 which is interpreted as disabled

Long offlineTime = entry.getValue();
long deregisterTime = offlineTime + deregisterDelay;

// Skip if instance is still online
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should rely on LiveInstances instead of history.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adjusted this check to be:

if (cache.getLiveInstances().containsKey(instanceName)) {

Comment on lines 61 to 63
if (deregisterTime < earliestDeregisterTime) {
earliestDeregisterTime = deregisterTime;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How this can happen? The else clause condition is deregisterTime > stageStartTime. It will never be triggered.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adjusted this to remove the if and clarified variable name

the if branch of (deregisterTime <= stageStartTime) should catch all nodes that need to be deregistered during this run

the else will include all nodes that need to be deregistered at some point in the future. We then look for the node that is next in line to be deregistered and use that time to schedule the next deregistration

now logic is:

      // If deregister time is in the past, deregister the instance
      if (deregisterTime <= stageStartTime) {
        participantsToDeregister.add(instanceName);
      } else {
        // Otherwise, find the next earliest deregister time
        nextDeregisterTime = Math.min(nextDeregisterTime, deregisterTime);
      }

Comment on lines 76 to 77
if (earliestDeregisterTime != Long.MAX_VALUE) {
long delay = earliestDeregisterTime - stageStartTime;
scheduleOnDemandPipeline(manager.getClusterName(), delay);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not accurate... If this stage is very slow, it already passed the time for next deregistration. Use the next closest deregister instance time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, thank you for calling this out.
Updated calculation of delay

long delay = nextDeregisterTime - System.currentTimeMillis();

Do you think there is an argument for the check of whether a participant should be deregistered

if (deregisterTime <= stageStartTime) {

should also utilize currentTime rather than stageStartTime

@GrantPSpencer
Copy link
Contributor Author

Hi @junkaixue, sorry for the huge delay on responding to feedback. Had 2x oncall rotations, then vacation and got sick. Please take a look at this when you can. Ideally, we would like this to get incorporated into the upcoming release if possible

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants