Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create 2024-11-03-linux-load-average-myths-and-realities.md #372

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 28 additions & 0 deletions blog/2024-11-03-linux-load-average-myths-and-realities.md
Original file line number Diff line number Diff line change
Expand Up @@ -97,3 +97,31 @@ Another issue with load average on Linux is that, unlike most operating systems
Unfortunately, there’s not much we can do to fully eliminate artificial load average spikes when running Netdata. Lowering data collection frequency and adding significant jitter would reduce spikes, but at the cost of data accuracy, which is something we prioritize at Netdata. The load average calculation in the Linux kernel simply doesn’t provide an accurate view for high-frequency, high-concurrency workloads like ours.

For users of monitoring systems, this highlights the importance of **not relying solely on load average** as an indicator of system health. Complementary metrics, such as CPU utilization and pressure metrics, provide a more accurate and stable view of actual resource usage and contention.

## Beyond Load Average: Consider PSI for Accurate Resource Contention

For users looking for a more precise indicator of system health, **Pressure Stall Information (PSI)** offers a modern alternative to load average. Unlike load average, which is an aggregate view that can be skewed by high concurrency and short-lived tasks, PSI measures the **pressure on specific resources** (CPU, memory, and I/O) and provides insight into how often tasks are delayed due to resource contention.

PSI was introduced in the Linux kernel starting with version 4.20 and is designed to help you understand **how much time tasks spend waiting for resources**. Here’s a breakdown of each PSI metric and what it tells you:

### CPU Pressure

- **`system.cpu_some_pressure`**: This metric shows the percentage of time some tasks were delayed due to insufficient CPU resources. It indicates partial CPU contention, where some tasks experience delays but not the entire system.
- **`system.cpu_some_pressure_stall_time`**: This metrics shows the amount of time some tasks were delayed due to insufficient CPU resources.
ktsaou marked this conversation as resolved.
Show resolved Hide resolved

For containers, Netdata provides:

- **`cgroup.cpu_some_pressure`**: The percentage of time some container tasks were delayed due to insufficient CPU resources.
- **`cgroup.cpu_some_pressure_stall_time`**: The amount of time some container tasks were delayed due to insufficient CPU resources.
- **`cgroup.cpu_full_pressure`**: The percentage of time all non-idle container tasks were delayed due to insufficient CPU resources.
- **`cgroup.cpu_full_pressure_stall_time`**: The mount of time all non-idle container tasks were delayed due to insufficient CPU resources.
ktsaou marked this conversation as resolved.
Show resolved Hide resolved

### Memory and I/O Pressure

Similarly Netdata provides pressure metrics for memory and I/O.
ktsaou marked this conversation as resolved.
Show resolved Hide resolved

### Why PSI is Better Than Load Average for Monitoring Contention

Unlike load average, which is an indirect measure that can be affected by task scheduling quirks and asynchronous load calculations, **PSI directly measures contention on critical resources**. PSI allows you to pinpoint whether the system is facing real pressure on CPU, memory, or I/O resources.

For example, if you see high `system.cpu_some_pressure` values, you know that some tasks are facing CPU contention. By contrast, load average can be misleading in these situations, often suggesting extreme load spikes that don’t align with actual resource contention.