Configurable TIMEOUTs for Worker via ENV VAR #319

hanslemm · 2024-04-08T09:10:16Z

What

This PR introduces configurable Timeouts for Worker constants via ENV VARs.
For some K8s setup where nodes are spun up on demand, it might be the case that 2 minutes tolerance for pod creation is not enough.

How

With the what in mind, this PR introduces the following ENV VARs that maps to their correspondent constants:

INIT_CONTAINER_STARTUP_TIMEOUT -> accepts only >0 values.
INIT_CONTAINER_TERMINATION_TIMEOUT -> accepts only >0 values.
POD_READY_TIMEOUT -> accepts only >0 values.

When the ENV VARs are not set, the current default values are assumed.

Furthermore, when the user provides an invalid value for such constants, the parseEnvVar function handles it, printing a helpful message saying that the format is invalid, and that it is assuming the default value.

Can this PR be safely reverted and rolled back?

YES 💚
NO ❌

CLAassistant · 2024-04-08T09:10:21Z

All committers have signed the CLA.

bgroff · 2024-04-25T19:47:50Z

The default timeout has been adjusted to 15 minutes per attempt. We will try 5 attempts, so this is roughly an hour and a half. With this information, I am going to close this PR as wont fix.

david-freistrom · 2024-08-06T16:47:04Z

@bgroff

The default timeout has been adjusted to 15 minutes per attempt. We will try 5 attempts, so this is roughly an hour and a half. With this information, I am going to close this PR as wont fix.

Where you defined 15 minutes? The code which is refereced in the PR shows 2 minutes. And there is no attempt for such pod creation at all.

public static final Duration POD_READY_TIMEOUT = Duration.ofMinutes(2);

2 Minutes is for serverless arch and autoscaled nodes not enough if we first need to provide a compute resources.

The PR did not change your defaults. It just make your hard coded values configurable. What is wrong with that?

The codelines which cause the issue we face also explain our use case.

LOGGER.info("Waiting until pod is ready...");
      // If a pod gets into a non-terminal error state it should be automatically killed by our
      // heartbeating mechanism.
      // This also handles the case where a very short pod already completes before this check completes
      // the first time.
      // This doesn't manage things like pods that are blocked from running for some cluster reason or if
      // the init
      // container got stuck somehow.
      fabricClient.resource(podDefinition).waitUntilCondition(p -> {
        final boolean isReady = Objects.nonNull(p) && Readiness.getInstance().isReady(p);
        final boolean isTerminal = Objects.nonNull(p) && KubePodResourceHelper.isTerminal(p);
        return isReady || isTerminal;
      }, POD_READY_TIMEOUT.toMinutes(), TimeUnit.MINUTES);
      MetricClientFactory.getMetricClient().distribution(OssMetricsRegistry.KUBE_POD_PROCESS_CREATE_TIME_MILLISECS,
          System.currentTimeMillis() - start);

Our Pod stucks in Pending state and wait for a node become ready to be scheduled on.

hanslemm added 2 commits April 8, 2024 10:49

feat: env var configurable constants

522774b

chore: better error handling

164b24e

hanslemm and others added 2 commits April 8, 2024 11:14

chore: add back the Copyright

dc5017d

Merge branch 'main' into main

878def0

bgroff closed this Apr 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Configurable TIMEOUTs for Worker via ENV VAR #319

Configurable TIMEOUTs for Worker via ENV VAR #319

hanslemm commented Apr 8, 2024 •

edited

Loading

CLAassistant commented Apr 8, 2024 •

edited

Loading

bgroff commented Apr 25, 2024

david-freistrom commented Aug 6, 2024 •

edited

Loading

Configurable TIMEOUTs for Worker via ENV VAR #319

Configurable TIMEOUTs for Worker via ENV VAR #319

Conversation

hanslemm commented Apr 8, 2024 • edited Loading

What

How

Can this PR be safely reverted and rolled back?

CLAassistant commented Apr 8, 2024 • edited Loading

bgroff commented Apr 25, 2024

david-freistrom commented Aug 6, 2024 • edited Loading

hanslemm commented Apr 8, 2024 •

edited

Loading

CLAassistant commented Apr 8, 2024 •

edited

Loading

david-freistrom commented Aug 6, 2024 •

edited

Loading