Skip to content

Commit

Permalink
Move eviction-policy from tasks to concepts
Browse files Browse the repository at this point in the history
add what's next to eviction policy
  • Loading branch information
gm7y8 authored and sftim committed Sep 7, 2020
1 parent f0a32c7 commit b330bb0
Show file tree
Hide file tree
Showing 5 changed files with 74 additions and 78 deletions.
23 changes: 23 additions & 0 deletions content/en/docs/concepts/scheduling-eviction/eviction-policy.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
---
title: Eviction Policy
content_template: templates/concept
weight: 60
---

<!-- overview -->

This page is an overview of Kubernetes' policy for eviction.

<!-- body -->

## Eviction Policy

The {{< glossary_tooltip text="Kubelet" term_id="kubelet" >}} can proactively monitor for and prevent total starvation of a
compute resource. In those cases, the `kubelet` can reclaim the starved
resource by proactively failing one or more Pods. When the `kubelet` fails
a Pod, it terminates all of its containers and transitions its `PodPhase` to `Failed`.
If the evicted Pod is managed by a Deployment, the Deployment will create another Pod
to be scheduled by Kubernetes.

## {{% heading "whatsnext" %}}
- Read [Configure out of resource handling](/docs/tasks/administer-cluster/out-of-resource/) to learn more about eviction signals, thresholds, and handling.
11 changes: 4 additions & 7 deletions content/en/docs/concepts/scheduling-eviction/kube-scheduler.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ kube-scheduler is designed so that, if you want and need to, you can
write your own scheduling component and use that instead.

For every newly created pod or other unscheduled pods, kube-scheduler
selects an optimal node for them to run on. However, every container in
selects an optimal node for them to run on. However, every container in
pods has different requirements for resources and every pod also has
different requirements. Therefore, existing nodes need to be filtered
according to the specific scheduling requirements.
Expand Down Expand Up @@ -77,12 +77,9 @@ one of these at random.
There are two supported ways to configure the filtering and scoring behavior
of the scheduler:

1. [Scheduling Policies](/docs/reference/scheduling/policies) allow you to
configure _Predicates_ for filtering and _Priorities_ for scoring.
1. [Scheduling Profiles](/docs/reference/scheduling/config/#profiles) allow you
to configure Plugins that implement different scheduling stages, including:
`QueueSort`, `Filter`, `Score`, `Bind`, `Reserve`, `Permit`, and others. You
can also configure the kube-scheduler to run different profiles.

1. [Scheduling Policies](/docs/reference/scheduling/policies) allow you to configure _Predicates_ for filtering and _Priorities_ for scoring.
1. [Scheduling Profiles](/docs/reference/scheduling/profiles) allow you to configure Plugins that implement different scheduling stages, including: `QueueSort`, `Filter`, `Score`, `Bind`, `Reserve`, `Permit`, and others. You can also configure the kube-scheduler to run different profiles.


## {{% heading "whatsnext" %}}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ reviewers:
- bsalamat
title: Scheduler Performance Tuning
content_type: concept
weight: 70
weight: 80
---

<!-- overview -->
Expand Down Expand Up @@ -48,10 +48,13 @@ To change the value, edit the kube-scheduler configuration file (this is likely
to be `/etc/kubernetes/config/kube-scheduler.yaml`), then restart the scheduler.

After you have made this change, you can run

```bash
kubectl get componentstatuses
```

to verify that the kube-scheduler component is healthy. The output is similar to:

```
NAME STATUS MESSAGE ERROR
controller-manager Healthy ok
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ reviewers:
- ahg-g
title: Scheduling Framework
content_type: concept
weight: 60
weight: 70
---

<!-- overview -->
Expand Down
111 changes: 42 additions & 69 deletions content/en/docs/tasks/administer-cluster/out-of-resource.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,28 +18,19 @@ nodes become unstable.

<!-- body -->

## Eviction Policy

The `kubelet` can proactively monitor for and prevent total starvation of a
compute resource. In those cases, the `kubelet` can reclaim the starved
resource by proactively failing one or more Pods. When the `kubelet` fails
a Pod, it terminates all of its containers and transitions its `PodPhase` to `Failed`.
If the evicted Pod is managed by a Deployment, the Deployment will create another Pod
to be scheduled by Kubernetes.

### Eviction Signals

The `kubelet` supports eviction decisions based on the signals described in the following
table. The value of each signal is described in the Description column, which is based on
the `kubelet` summary API.

| Eviction Signal | Description |
|----------------------------|-----------------------------------------------------------------------|
| `memory.available` | `memory.available` := `node.status.capacity[memory]` - `node.stats.memory.workingSet` |
| `nodefs.available` | `nodefs.available` := `node.stats.fs.available` |
| `nodefs.inodesFree` | `nodefs.inodesFree` := `node.stats.fs.inodesFree` |
| `imagefs.available` | `imagefs.available` := `node.stats.runtime.imagefs.available` |
| `imagefs.inodesFree` | `imagefs.inodesFree` := `node.stats.runtime.imagefs.inodesFree` |
| Eviction Signal | Description |
|----------------------|---------------------------------------------------------------------------------------|
| `memory.available` | `memory.available` := `node.status.capacity[memory]` - `node.stats.memory.workingSet` |
| `nodefs.available` | `nodefs.available` := `node.stats.fs.available` |
| `nodefs.inodesFree` | `nodefs.inodesFree` := `node.stats.fs.inodesFree` |
| `imagefs.available` | `imagefs.available` := `node.stats.runtime.imagefs.available` |
| `imagefs.inodesFree` | `imagefs.inodesFree` := `node.stats.runtime.imagefs.inodesFree` |

Each of the above signals supports either a literal or percentage based value.
The percentage based value is calculated relative to the total capacity
Expand All @@ -65,7 +56,7 @@ memory is reclaimable under pressure.
`imagefs` is optional. `kubelet` auto-discovers these filesystems using
cAdvisor. `kubelet` does not care about any other filesystems. Any other types
of configurations are not currently supported by the kubelet. For example, it is
*not OK* to store volumes and logs in a dedicated `filesystem`.
_not OK_ to store volumes and logs in a dedicated `filesystem`.

In future releases, the `kubelet` will deprecate the existing [garbage
collection](/docs/concepts/cluster-administration/kubelet-garbage-collection/)
Expand All @@ -83,9 +74,7 @@ where:

* `eviction-signal` is an eviction signal token as defined in the previous table.
* `operator` is the desired relational operator, such as `<` (less than).
* `quantity` is the eviction threshold quantity, such as `1Gi`. These tokens must
match the quantity representation used by Kubernetes. An eviction threshold can also
be expressed as a percentage using the `%` token.
* `quantity` is the eviction threshold quantity, such as `1Gi`. These tokens must match the quantity representation used by Kubernetes. An eviction threshold can also be expressed as a percentage using the `%` token.

For example, if a node has `10Gi` of total memory and you want trigger eviction if
the available memory falls below `1Gi`, you can define the eviction threshold as
Expand All @@ -108,12 +97,9 @@ termination.

To configure soft eviction thresholds, the following flags are supported:

* `eviction-soft` describes a set of eviction thresholds (e.g. `memory.available<1.5Gi`) that if met over a
corresponding grace period would trigger a Pod eviction.
* `eviction-soft-grace-period` describes a set of eviction grace periods (e.g. `memory.available=1m30s`) that
correspond to how long a soft eviction threshold must hold before triggering a Pod eviction.
* `eviction-max-pod-grace-period` describes the maximum allowed grace period (in seconds) to use when terminating
pods in response to a soft eviction threshold being met.
* `eviction-soft` describes a set of eviction thresholds (e.g. `memory.available<1.5Gi`) that if met over a corresponding grace period would trigger a Pod eviction.
* `eviction-soft-grace-period` describes a set of eviction grace periods (e.g. `memory.available=1m30s`) that correspond to how long a soft eviction threshold must hold before triggering a Pod eviction.
* `eviction-max-pod-grace-period` describes the maximum allowed grace period (in seconds) to use when terminating pods in response to a soft eviction threshold being met.

#### Hard Eviction Thresholds

Expand All @@ -124,8 +110,7 @@ with no graceful termination.

To configure hard eviction thresholds, the following flag is supported:

* `eviction-hard` describes a set of eviction thresholds (e.g. `memory.available<1Gi`) that if met
would trigger a Pod eviction.
* `eviction-hard` describes a set of eviction thresholds (e.g. `memory.available<1Gi`) that if met would trigger a Pod eviction.

The `kubelet` has the following default hard eviction threshold:

Expand All @@ -150,10 +135,10 @@ reflects the node is under pressure.

The following node conditions are defined that correspond to the specified eviction signal.

| Node Condition | Eviction Signal | Description |
|-------------------------|-------------------------------|--------------------------------------------|
| `MemoryPressure` | `memory.available` | Available memory on the node has satisfied an eviction threshold |
| `DiskPressure` | `nodefs.available`, `nodefs.inodesFree`, `imagefs.available`, or `imagefs.inodesFree` | Available disk space and inodes on either the node's root filesystem or image filesystem has satisfied an eviction threshold |
| Node Condition | Eviction Signal | Description |
|-------------------|---------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------|
| `MemoryPressure` | `memory.available` | Available memory on the node has satisfied an eviction threshold |
| `DiskPressure` | `nodefs.available`, `nodefs.inodesFree`, `imagefs.available`, or `imagefs.inodesFree` | Available disk space and inodes on either the node's root filesystem or image filesystem has satisfied an eviction threshold |

The `kubelet` continues to report node status updates at the frequency specified by
`--node-status-update-frequency` which defaults to `10s`.
Expand All @@ -168,8 +153,7 @@ as a consequence.
To protect against this oscillation, the following flag is defined to control how
long the `kubelet` must wait before transitioning out of a pressure condition.

* `eviction-pressure-transition-period` is the duration for which the `kubelet` has
to wait before transitioning out of an eviction pressure condition.
* `eviction-pressure-transition-period` is the duration for which the `kubelet` has to wait before transitioning out of an eviction pressure condition.

The `kubelet` would ensure that it has not observed an eviction threshold being met
for the specified pressure condition for the period specified before toggling the
Expand Down Expand Up @@ -207,17 +191,8 @@ then by [Priority](/docs/concepts/configuration/pod-priority-preemption/), and t

As a result, `kubelet` ranks and evicts Pods in the following order:

* `BestEffort` or `Burstable` Pods whose usage of a starved resource exceeds its request.
Such pods are ranked by Priority, and then usage above request.
* `Guaranteed` pods and `Burstable` pods whose usage is beneath requests are evicted last.
`Guaranteed` Pods are guaranteed only when requests and limits are specified for all
the containers and they are equal. Such pods are guaranteed to never be evicted because
of another Pod's resource consumption. If a system daemon (such as `kubelet`, `docker`,
and `journald`) is consuming more resources than were reserved via `system-reserved` or
`kube-reserved` allocations, and the node only has `Guaranteed` or `Burstable` Pods using
less than requests remaining, then the node must choose to evict such a Pod in order to
preserve node stability and to limit the impact of the unexpected consumption to other Pods.
In this case, it will choose to evict pods of Lowest Priority first.
* `BestEffort` or `Burstable` Pods whose usage of a starved resource exceeds its request. Such pods are ranked by Priority, and then usage above request.
* `Guaranteed` pods and `Burstable` pods whose usage is beneath requests are evicted last. `Guaranteed` Pods are guaranteed only when requests and limits are specified for all the containers and they are equal. Such pods are guaranteed to never be evicted because of another Pod's resource consumption. If a system daemon (such as `kubelet`, `docker`, and `journald`) is consuming more resources than were reserved via `system-reserved` or `kube-reserved` allocations, and the node only has `Guaranteed` or `Burstable` Pods using less than requests remaining, then the node must choose to evict such a Pod in order to preserve node stability and to limit the impact of the unexpected consumption to other Pods. In this case, it will choose to evict pods of Lowest Priority first.

If necessary, `kubelet` evicts Pods one at a time to reclaim disk when `DiskPressure`
is encountered. If the `kubelet` is responding to `inode` starvation, it reclaims
Expand All @@ -228,20 +203,21 @@ that consumes the largest amount of disk and kills those first.
#### With `imagefs`

If `nodefs` is triggering evictions, `kubelet` sorts Pods based on the usage on `nodefs`

- local volumes + logs of all its containers.

If `imagefs` is triggering evictions, `kubelet` sorts Pods based on the writable layer usage of all its containers.

#### Without `imagefs`

If `nodefs` is triggering evictions, `kubelet` sorts Pods based on their total disk usage

- local volumes + logs & writable layer of all its containers.

### Minimum eviction reclaim

In certain scenarios, eviction of Pods could result in reclamation of small amount of resources. This can result in
`kubelet` hitting eviction thresholds in repeated successions. In addition to that, eviction of resources like `disk`,
is time consuming.
`kubelet` hitting eviction thresholds in repeated successions. In addition to that, eviction of resources like `disk`, is time consuming.

To mitigate these issues, `kubelet` can have a per-resource `minimum-reclaim`. Whenever `kubelet` observes
resource pressure, `kubelet` attempts to reclaim at least `minimum-reclaim` amount of resource below
Expand All @@ -268,10 +244,10 @@ The node reports a condition when a compute resource is under pressure. The
scheduler views that condition as a signal to dissuade placing additional
pods on the node.

| Node Condition | Scheduler Behavior |
| ---------------- | ------------------------------------------------ |
| `MemoryPressure` | No new `BestEffort` Pods are scheduled to the node. |
| `DiskPressure` | No new Pods are scheduled to the node. |
| Node Condition | Scheduler Behavior |
| ------------------| ----------------------------------------------------|
| `MemoryPressure` | No new `BestEffort` Pods are scheduled to the node. |
| `DiskPressure` | No new Pods are scheduled to the node. |

## Node OOM Behavior

Expand All @@ -280,11 +256,11 @@ the node depends on the [oom_killer](https://lwn.net/Articles/391222/) to respon

The `kubelet` sets a `oom_score_adj` value for each container based on the quality of service for the Pod.

| Quality of Service | oom_score_adj |
|----------------------------|-----------------------------------------------------------------------|
| `Guaranteed` | -998 |
| `BestEffort` | 1000 |
| `Burstable` | min(max(2, 1000 - (1000 * memoryRequestBytes) / machineMemoryCapacityBytes), 999) |
| Quality of Service | oom_score_adj |
|--------------------|-----------------------------------------------------------------------------------|
| `Guaranteed` | -998 |
| `BestEffort` | 1000 |
| `Burstable` | min(max(2, 1000 - (1000 * memoryRequestBytes) / machineMemoryCapacityBytes), 999) |

If the `kubelet` is unable to reclaim memory prior to a node experiencing system OOM, the `oom_killer` calculates
an `oom_score` based on the percentage of memory it's using on the node, and then add the `oom_score_adj` to get an
Expand Down Expand Up @@ -325,10 +301,7 @@ and trigger eviction assuming those Pods use less than their configured request.

### DaemonSet

As `Priority` is a key factor in the eviction strategy, if you do not want
pods belonging to a `DaemonSet` to be evicted, specify a sufficiently high priorityClass
in the pod spec template. If you want pods belonging to a `DaemonSet` to run only if
there are sufficient resources, specify a lower or default priorityClass.
As `Priority` is a key factor in the eviction strategy, if you do not want pods belonging to a `DaemonSet` to be evicted, specify a sufficiently high priorityClass in the pod spec template. If you want pods belonging to a `DaemonSet` to run only if there are sufficient resources, specify a lower or default priorityClass.


## Deprecation of existing feature flags to reclaim disk
Expand All @@ -338,15 +311,15 @@ there are sufficient resources, specify a lower or default priorityClass.
As disk based eviction matures, the following `kubelet` flags are marked for deprecation
in favor of the simpler configuration supported around eviction.

| Existing Flag | New Flag |
| ------------- | -------- |
| `--image-gc-high-threshold` | `--eviction-hard` or `eviction-soft` |
| `--image-gc-low-threshold` | `--eviction-minimum-reclaim` |
| `--maximum-dead-containers` | deprecated |
| `--maximum-dead-containers-per-container` | deprecated |
| `--minimum-container-ttl-duration` | deprecated |
| `--low-diskspace-threshold-mb` | `--eviction-hard` or `eviction-soft` |
| `--outofdisk-transition-frequency` | `--eviction-pressure-transition-period` |
| Existing Flag | New Flag |
| ------------------------------------------ | ----------------------------------------|
| `--image-gc-high-threshold` | `--eviction-hard` or `eviction-soft` |
| `--image-gc-low-threshold` | `--eviction-minimum-reclaim` |
| `--maximum-dead-containers` | deprecated |
| `--maximum-dead-containers-per-container` | deprecated |
| `--minimum-container-ttl-duration` | deprecated |
| `--low-diskspace-threshold-mb` | `--eviction-hard` or `eviction-soft` |
| `--outofdisk-transition-frequency` | `--eviction-pressure-transition-period` |

## Known issues

Expand Down

0 comments on commit b330bb0

Please sign in to comment.