Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Configurable TIMEOUTs for Worker via ENV VAR #319

Closed
wants to merge 4 commits into from

Conversation

hanslemm
Copy link

@hanslemm hanslemm commented Apr 8, 2024

What

This PR introduces configurable Timeouts for Worker constants via ENV VARs.
For some K8s setup where nodes are spun up on demand, it might be the case that 2 minutes tolerance for pod creation is not enough.

How

With the what in mind, this PR introduces the following ENV VARs that maps to their correspondent constants:

  • INIT_CONTAINER_STARTUP_TIMEOUT -> accepts only >0 values.
  • INIT_CONTAINER_TERMINATION_TIMEOUT -> accepts only >0 values.
  • POD_READY_TIMEOUT -> accepts only >0 values.

When the ENV VARs are not set, the current default values are assumed.

Furthermore, when the user provides an invalid value for such constants, the parseEnvVar function handles it, printing a helpful message saying that the format is invalid, and that it is assuming the default value.

Can this PR be safely reverted and rolled back?

  • YES 💚
  • NO ❌

@CLAassistant
Copy link

CLAassistant commented Apr 8, 2024

CLA assistant check
All committers have signed the CLA.

@bgroff
Copy link
Contributor

bgroff commented Apr 25, 2024

The default timeout has been adjusted to 15 minutes per attempt. We will try 5 attempts, so this is roughly an hour and a half. With this information, I am going to close this PR as wont fix.

@bgroff bgroff closed this Apr 25, 2024
@david-freistrom
Copy link

david-freistrom commented Aug 6, 2024

@bgroff

The default timeout has been adjusted to 15 minutes per attempt. We will try 5 attempts, so this is roughly an hour and a half. With this information, I am going to close this PR as wont fix.

Where you defined 15 minutes? The code which is refereced in the PR shows 2 minutes. And there is no attempt for such pod creation at all.

public static final Duration POD_READY_TIMEOUT = Duration.ofMinutes(2);

2 Minutes is for serverless arch and autoscaled nodes not enough if we first need to provide a compute resources.

The PR did not change your defaults. It just make your hard coded values configurable. What is wrong with that?

The codelines which cause the issue we face also explain our use case.

LOGGER.info("Waiting until pod is ready...");
      // If a pod gets into a non-terminal error state it should be automatically killed by our
      // heartbeating mechanism.
      // This also handles the case where a very short pod already completes before this check completes
      // the first time.
      // This doesn't manage things like pods that are blocked from running for some cluster reason or if
      // the init
      // container got stuck somehow.
      fabricClient.resource(podDefinition).waitUntilCondition(p -> {
        final boolean isReady = Objects.nonNull(p) && Readiness.getInstance().isReady(p);
        final boolean isTerminal = Objects.nonNull(p) && KubePodResourceHelper.isTerminal(p);
        return isReady || isTerminal;
      }, POD_READY_TIMEOUT.toMinutes(), TimeUnit.MINUTES);
      MetricClientFactory.getMetricClient().distribution(OssMetricsRegistry.KUBE_POD_PROCESS_CREATE_TIME_MILLISECS,
          System.currentTimeMillis() - start);

Our Pod stucks in Pending state and wait for a node become ready to be scheduled on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants