Skip to content

Commit

Permalink
feat: add retry logic for k8s client argoproj#7692 (argoproj#16154)
Browse files Browse the repository at this point in the history
* add retry logic for k8s client

Signed-off-by: Pavel Aborilov <[email protected]>

* add docs for retry logic and envs to manifests

Signed-off-by: Pavel Aborilov <[email protected]>

---------

Signed-off-by: Pavel Aborilov <[email protected]>
Signed-off-by: Pavel <[email protected]>
  • Loading branch information
aborilov authored and alexmt committed Jan 19, 2024
1 parent 1b025d2 commit 3bca48a
Show file tree
Hide file tree
Showing 11 changed files with 333 additions and 6 deletions.
10 changes: 10 additions & 0 deletions docs/operator-manual/argocd-cmd-params-cm.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,12 @@ data:
controller.sharding.algorithm: legacy
# Number of allowed concurrent kubectl fork/execs. Any value less than 1 means no limit.
controller.kubectl.parallelism.limit: "20"
# The maximum number of retries for each request
controller.k8sclient.retry.max: "0"
# The initial backoff delay on the first retry attempt in ms. Subsequent retries will double this backoff time up to a maximum threshold
controller.k8sclient.retry.base.backoff: "100"
# Grace period in seconds for ignoring consecutive errors while communicating with repo server.
controller.repo.error.grace.period.seconds: "180"

## Server properties
# Listen on given address for incoming connections (default "0.0.0.0")
Expand All @@ -72,6 +78,10 @@ data:
server.rootpath: ""
# Directory path that contains additional static assets
server.staticassets: "/shared/app"
# The maximum number of retries for each request
server.k8sclient.retry.max: "0"
# The initial backoff delay on the first retry attempt in ms. Subsequent retries will double this backoff time up to a maximum threshold
server.k8sclient.retry.base.backoff: "100"

# Set the logging format. One of: text|json (default "text")
server.log.format: "text"
Expand Down
88 changes: 88 additions & 0 deletions docs/operator-manual/high_availability.md
Original file line number Diff line number Diff line change
Expand Up @@ -229,3 +229,91 @@ spec:
path: my-application
# ...
```

## Rate Limiting Application Reconciliations

To prevent high controller resource usage or sync loops caused either due to misbehaving apps or other environment specific factors,
we can configure rate limits on the workqueues used by the application controller. There are two types of rate limits that can be configured:

* Global rate limits
* Per item rate limits

The final rate limiter uses a combination of both and calculates the final backoff as `max(globalBackoff, perItemBackoff)`.

### Global rate limits

This is enabled by default, it is a simple bucket based rate limiter that limits the number of items that can be queued per second.
This is useful to prevent a large number of apps from being queued at the same time.

To configure the bucket limiter you can set the following environment variables:

* `WORKQUEUE_BUCKET_SIZE` - The number of items that can be queued in a single burst. Defaults to 500.
* `WORKQUEUE_BUCKET_QPS` - The number of items that can be queued per second. Defaults to 50.

### Per item rate limits

This by default returns a fixed base delay/backoff value but can be configured to return exponential values, read further to understand it's working.
Per item rate limiter limits the number of times a particular item can be queued. This is based on exponential backoff where the backoff time for an item keeps increasing exponentially
if it is queued multiple times in a short period, but the backoff is reset automatically if a configured `cool down` period has elapsed since the last time the item was queued.

To configure the per item limiter you can set the following environment variables:

* `WORKQUEUE_FAILURE_COOLDOWN_NS` : The cool down period in nanoseconds, once period has elapsed for an item the backoff is reset. Exponential backoff is disabled if set to 0(default), eg. values : 10 * 10^9 (=10s)
* `WORKQUEUE_BASE_DELAY_NS` : The base delay in nanoseconds, this is the initial backoff used in the exponential backoff formula. Defaults to 1000 (=1μs)
* `WORKQUEUE_MAX_DELAY_NS` : The max delay in nanoseconds, this is the max backoff limit. Defaults to 3 * 10^9 (=3s)
* `WORKQUEUE_BACKOFF_FACTOR` : The backoff factor, this is the factor by which the backoff is increased for each retry. Defaults to 1.5

The formula used to calculate the backoff time for an item, where `numRequeue` is the number of times the item has been queued
and `lastRequeueTime` is the time at which the item was last queued:

- When `WORKQUEUE_FAILURE_COOLDOWN_NS` != 0 :

```
backoff = time.Since(lastRequeueTime) >= WORKQUEUE_FAILURE_COOLDOWN_NS ?
WORKQUEUE_BASE_DELAY_NS :
min(
WORKQUEUE_MAX_DELAY_NS,
WORKQUEUE_BASE_DELAY_NS * WORKQUEUE_BACKOFF_FACTOR ^ (numRequeue)
)
```

- When `WORKQUEUE_FAILURE_COOLDOWN_NS` = 0 :

```
backoff = WORKQUEUE_BASE_DELAY_NS
```

## HTTP Request Retry Strategy

In scenarios where network instability or transient server errors occur, the retry strategy ensures the robustness of HTTP communication by automatically resending failed requests. It uses a combination of maximum retries and backoff intervals to prevent overwhelming the server or thrashing the network.

### Configuring Retries

The retry logic can be fine-tuned with the following environment variables:

* `ARGOCD_K8SCLIENT_RETRY_MAX` - The maximum number of retries for each request. The request will be dropped after this count is reached. Defaults to 0 (no retries).
* `ARGOCD_K8SCLIENT_RETRY_BASE_BACKOFF` - The initial backoff delay on the first retry attempt in ms. Subsequent retries will double this backoff time up to a maximum threshold. Defaults to 100ms.

### Backoff Strategy

The backoff strategy employed is a simple exponential backoff without jitter. The backoff time increases exponentially with each retry attempt until a maximum backoff duration is reached.

The formula for calculating the backoff time is:

```
backoff = min(retryWaitMax, baseRetryBackoff * (2 ^ retryAttempt))
```
Where `retryAttempt` starts at 0 and increments by 1 for each subsequent retry.

### Maximum Wait Time

There is a cap on the backoff time to prevent excessive wait times between retries. This cap is defined by:

`retryWaitMax` - The maximum duration to wait before retrying. This ensures that retries happen within a reasonable timeframe. Defaults to 10 seconds.

### Non-Retriable Conditions

Not all HTTP responses are eligible for retries. The following conditions will not trigger a retry:

* Responses with a status code indicating client errors (4xx) except for 429 Too Many Requests.
* Responses with the status code 501 Not Implemented.
Original file line number Diff line number Diff line change
Expand Up @@ -155,6 +155,18 @@ spec:
name: argocd-cmd-params-cm
key: controller.kubectl.parallelism.limit
optional: true
- name: ARGOCD_K8SCLIENT_RETRY_MAX
valueFrom:
configMapKeyRef:
name: argocd-cmd-params-cm
key: controller.k8sclient.retry.max
optional: true
- name: ARGOCD_K8SCLIENT_RETRY_BASE_BACKOFF
valueFrom:
configMapKeyRef:
name: argocd-cmd-params-cm
key: controller.k8sclient.retry.base.backoff
optional: true
image: quay.io/argoproj/argocd:latest
imagePullPolicy: Always
name: argocd-application-controller
Expand Down
12 changes: 12 additions & 0 deletions manifests/base/server/argocd-server-deployment.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -227,6 +227,18 @@ spec:
name: argocd-cmd-params-cm
key: server.enable.proxy.extension
optional: true
- name: ARGOCD_K8SCLIENT_RETRY_MAX
valueFrom:
configMapKeyRef:
name: argocd-cmd-params-cm
key: server.k8sclient.retry.max
optional: true
- name: ARGOCD_K8SCLIENT_RETRY_BASE_BACKOFF
valueFrom:
configMapKeyRef:
name: argocd-cmd-params-cm
key: server.k8sclient.retry.base.backoff
optional: true
volumeMounts:
- name: ssh-known-hosts
mountPath: /app/config/ssh
Expand Down
12 changes: 12 additions & 0 deletions manifests/core-install.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -19451,6 +19451,18 @@ spec:
key: controller.kubectl.parallelism.limit
name: argocd-cmd-params-cm
optional: true
- name: ARGOCD_K8SCLIENT_RETRY_MAX
valueFrom:
configMapKeyRef:
key: controller.k8sclient.retry.max
name: argocd-cmd-params-cm
optional: true
- name: ARGOCD_K8SCLIENT_RETRY_BASE_BACKOFF
valueFrom:
configMapKeyRef:
key: controller.k8sclient.retry.base.backoff
name: argocd-cmd-params-cm
optional: true
image: quay.io/argoproj/argocd:v2.8.8
imagePullPolicy: Always
name: argocd-application-controller
Expand Down
28 changes: 28 additions & 0 deletions manifests/ha/install.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -20995,6 +20995,18 @@ spec:
key: server.enable.proxy.extension
name: argocd-cmd-params-cm
optional: true
- name: ARGOCD_K8SCLIENT_RETRY_MAX
valueFrom:
configMapKeyRef:
key: server.k8sclient.retry.max
name: argocd-cmd-params-cm
optional: true
- name: ARGOCD_K8SCLIENT_RETRY_BASE_BACKOFF
valueFrom:
configMapKeyRef:
key: server.k8sclient.retry.base.backoff
name: argocd-cmd-params-cm
optional: true
image: quay.io/argoproj/argocd:v2.8.8
imagePullPolicy: Always
livenessProbe:
Expand Down Expand Up @@ -21241,7 +21253,23 @@ spec:
key: controller.kubectl.parallelism.limit
name: argocd-cmd-params-cm
optional: true
<<<<<<< HEAD
image: quay.io/argoproj/argocd:v2.8.8
=======
- name: ARGOCD_K8SCLIENT_RETRY_MAX
valueFrom:
configMapKeyRef:
key: controller.k8sclient.retry.max
name: argocd-cmd-params-cm
optional: true
- name: ARGOCD_K8SCLIENT_RETRY_BASE_BACKOFF
valueFrom:
configMapKeyRef:
key: controller.k8sclient.retry.base.backoff
name: argocd-cmd-params-cm
optional: true
image: quay.io/argoproj/argocd:v2.8.7
>>>>>>> 1732b3105 (feat: add retry logic for k8s client #7692 (#16154))
imagePullPolicy: Always
name: argocd-application-controller
ports:
Expand Down
24 changes: 24 additions & 0 deletions manifests/ha/namespace-install.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2501,6 +2501,18 @@ spec:
key: server.enable.proxy.extension
name: argocd-cmd-params-cm
optional: true
- name: ARGOCD_K8SCLIENT_RETRY_MAX
valueFrom:
configMapKeyRef:
key: server.k8sclient.retry.max
name: argocd-cmd-params-cm
optional: true
- name: ARGOCD_K8SCLIENT_RETRY_BASE_BACKOFF
valueFrom:
configMapKeyRef:
key: server.k8sclient.retry.base.backoff
name: argocd-cmd-params-cm
optional: true
image: quay.io/argoproj/argocd:v2.8.8
imagePullPolicy: Always
livenessProbe:
Expand Down Expand Up @@ -2747,6 +2759,18 @@ spec:
key: controller.kubectl.parallelism.limit
name: argocd-cmd-params-cm
optional: true
- name: ARGOCD_K8SCLIENT_RETRY_MAX
valueFrom:
configMapKeyRef:
key: controller.k8sclient.retry.max
name: argocd-cmd-params-cm
optional: true
- name: ARGOCD_K8SCLIENT_RETRY_BASE_BACKOFF
valueFrom:
configMapKeyRef:
key: controller.k8sclient.retry.base.backoff
name: argocd-cmd-params-cm
optional: true
image: quay.io/argoproj/argocd:v2.8.8
imagePullPolicy: Always
name: argocd-application-controller
Expand Down
32 changes: 32 additions & 0 deletions manifests/install.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -20050,7 +20050,23 @@ spec:
key: server.enable.proxy.extension
name: argocd-cmd-params-cm
optional: true
<<<<<<< HEAD
image: quay.io/argoproj/argocd:v2.8.8
=======
- name: ARGOCD_K8SCLIENT_RETRY_MAX
valueFrom:
configMapKeyRef:
key: server.k8sclient.retry.max
name: argocd-cmd-params-cm
optional: true
- name: ARGOCD_K8SCLIENT_RETRY_BASE_BACKOFF
valueFrom:
configMapKeyRef:
key: server.k8sclient.retry.base.backoff
name: argocd-cmd-params-cm
optional: true
image: quay.io/argoproj/argocd:v2.8.7
>>>>>>> 1732b3105 (feat: add retry logic for k8s client #7692 (#16154))
imagePullPolicy: Always
livenessProbe:
httpGet:
Expand Down Expand Up @@ -20296,7 +20312,23 @@ spec:
key: controller.kubectl.parallelism.limit
name: argocd-cmd-params-cm
optional: true
<<<<<<< HEAD
image: quay.io/argoproj/argocd:v2.8.8
=======
- name: ARGOCD_K8SCLIENT_RETRY_MAX
valueFrom:
configMapKeyRef:
key: controller.k8sclient.retry.max
name: argocd-cmd-params-cm
optional: true
- name: ARGOCD_K8SCLIENT_RETRY_BASE_BACKOFF
valueFrom:
configMapKeyRef:
key: controller.k8sclient.retry.base.backoff
name: argocd-cmd-params-cm
optional: true
image: quay.io/argoproj/argocd:v2.8.7
>>>>>>> 1732b3105 (feat: add retry logic for k8s client #7692 (#16154))
imagePullPolicy: Always
name: argocd-application-controller
ports:
Expand Down
24 changes: 24 additions & 0 deletions manifests/namespace-install.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -1556,6 +1556,18 @@ spec:
key: server.enable.proxy.extension
name: argocd-cmd-params-cm
optional: true
- name: ARGOCD_K8SCLIENT_RETRY_MAX
valueFrom:
configMapKeyRef:
key: server.k8sclient.retry.max
name: argocd-cmd-params-cm
optional: true
- name: ARGOCD_K8SCLIENT_RETRY_BASE_BACKOFF
valueFrom:
configMapKeyRef:
key: server.k8sclient.retry.base.backoff
name: argocd-cmd-params-cm
optional: true
image: quay.io/argoproj/argocd:v2.8.8
imagePullPolicy: Always
livenessProbe:
Expand Down Expand Up @@ -1802,6 +1814,18 @@ spec:
key: controller.kubectl.parallelism.limit
name: argocd-cmd-params-cm
optional: true
- name: ARGOCD_K8SCLIENT_RETRY_MAX
valueFrom:
configMapKeyRef:
key: controller.k8sclient.retry.max
name: argocd-cmd-params-cm
optional: true
- name: ARGOCD_K8SCLIENT_RETRY_BASE_BACKOFF
valueFrom:
configMapKeyRef:
key: controller.k8sclient.retry.base.backoff
name: argocd-cmd-params-cm
optional: true
image: quay.io/argoproj/argocd:v2.8.8
imagePullPolicy: Always
name: argocd-application-controller
Expand Down
10 changes: 8 additions & 2 deletions pkg/apis/application/v1alpha1/types.go
Original file line number Diff line number Diff line change
Expand Up @@ -35,11 +35,11 @@ import (
"k8s.io/client-go/tools/clientcmd/api"
"sigs.k8s.io/yaml"

"github.com/argoproj/argo-cd/v2/util/env"

"github.com/argoproj/argo-cd/v2/common"
"github.com/argoproj/argo-cd/v2/util/collections"
"github.com/argoproj/argo-cd/v2/util/env"
"github.com/argoproj/argo-cd/v2/util/helm"
utilhttp "github.com/argoproj/argo-cd/v2/util/http"
"github.com/argoproj/argo-cd/v2/util/security"
)

Expand Down Expand Up @@ -2850,6 +2850,12 @@ func SetK8SConfigDefaults(config *rest.Config) error {
config.Timeout = K8sServerSideTimeout

config.Transport = tr
maxRetries := env.ParseInt64FromEnv(utilhttp.EnvRetryMax, 0, 1, math.MaxInt64)
if maxRetries > 0 {
backoffDurationMS := env.ParseInt64FromEnv(utilhttp.EnvRetryBaseBackoff, 100, 1, math.MaxInt64)
backoffDuration := time.Duration(backoffDurationMS) * time.Millisecond
config.WrapTransport = utilhttp.WithRetry(maxRetries, backoffDuration)
}
return nil
}

Expand Down
Loading

0 comments on commit 3bca48a

Please sign in to comment.