pageserver: improve read amp metric #10573

erikgrinaker · 2025-01-29T21:31:16Z

Problem

The current global pageserver_layers_visited_per_vectored_read_global metric does not appear to accurately measure read amplification. It divides the layer count by the number of reads in a batch, but this means that e.g. 10 reads with 100 L0 layers will only measure a read amp of 10 per read, while the actual read amp was 100.

While the cost of layer visits are amortized across the batch, and some layers may not intersect with a given key, each visited layer contributes directly to the observed latency for every read in the batch, which is what we care about.

Touches https://github.com/neondatabase/cloud/issues/23283.
Extracted from #10566.

Summary of changes

Count the number of layers visited towards each read in the batch, instead of the average across the batch.
Rename pageserver_layers_visited_per_vectored_read_global to pageserver_layers_per_read_global.
Reduce the read amp log warning threshold down from 512 to 100.

github-actions · 2025-01-30T00:15:58Z

7414 tests run: 7063 passed, 0 failed, 351 skipped (full report)

Flaky tests (7)

Postgres 17

test_pgdata_import_smoke[None-1024-RelBlockSize.MULTIPLE_RELATION_SEGMENTS]: debug-x86-64-without-lfc, release-arm64-with-lfc, release-arm64-without-lfc
test_pgdata_import_smoke[8-1024-RelBlockSize.MULTIPLE_RELATION_SEGMENTS]: debug-x86-64-without-lfc, release-arm64-with-lfc, release-arm64-without-lfc

Postgres 14

test_timeline_archive[4]: release-x86-64-with-lfc

Code coverage* (full report)

functions: 33.4% (8508 of 25499 functions)
lines: 49.1% (71435 of 145499 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
7cfd38c at 2025-01-30T00:15:57.405Z :recycle:}

## Problem We suspect that Postgres checkpoints will limit the number of page deltas necessary to reconstruct a page, but don't know for certain. Touches neondatabase/cloud#23283. ## Summary of changes Add `pageserver_deltas_per_read_global` metric. This pairs with `pageserver_layers_per_read_global` from #10573.

pageserver: improve read amp metric

7cfd38c

erikgrinaker requested review from problame and skyzh January 29, 2025 21:31

erikgrinaker requested a review from a team as a code owner January 29, 2025 21:31

This was referenced Jan 29, 2025

pageserver: add per-timeline read amp histogram #10566

Merged

pageserver: add pageserver_deltas_per_read_global metric #10570

Merged

skyzh approved these changes Jan 29, 2025

View reviewed changes

erikgrinaker added this pull request to the merge queue Jan 30, 2025

Merged via the queue into main with commit b247271 Jan 30, 2025
86 checks passed

erikgrinaker deleted the erik/layers-per-read-global branch January 30, 2025 09:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pageserver: improve read amp metric #10573

pageserver: improve read amp metric #10573

erikgrinaker commented Jan 29, 2025

github-actions bot commented Jan 30, 2025

Postgres 17

Postgres 14

pageserver: improve read amp metric #10573

pageserver: improve read amp metric #10573

Conversation

erikgrinaker commented Jan 29, 2025

Problem

Summary of changes

github-actions bot commented Jan 30, 2025

7414 tests run: 7063 passed, 0 failed, 351 skipped (full report)

Postgres 17

Postgres 14

Code coverage* (full report)