Proxy release 2024-11-07 #9674

vipvap · 2024-11-07T06:02:10Z

Proxy release 2024-11-07

Please merge this Pull Request using 'Create a merge commit' button

python based regression test setup for auth_broker. This uses a http mock for cplane as well as the JWKs url. complications: 1. We cannot just use local_proxy binary, as that requires the pg_session_jwt extension which we don't have available in the current test suite 2. We cannot use just any old http mock for local_proxy, as auth_broker requires http2 to local_proxy as such, I used the h2 library to implement an echo server - copied from the examples in the h2 docs.

## Problem Indices used to be the only kind of object where we had to search across generations to find the most recent one. As of #9543, manifests will need the same treatment. ## Summary of changes - Refactor download_index_part to a generic download_generation_object function, which will be usable for downloading manifest objects as well.

…nifest uploads (#9557) ## Problem Uploads of the tenant manifest could race between different tasks, resulting in unexpected results in remote storage. Closes: #9556 ## Summary of changes - Create a central function for uploads that takes a tokio::sync::Mutex - Store the latest upload in that Mutex, so that when there is lots of concurrency (e.g. archive 20 timelines at once) we can coalesce their manifest writes somewhat.

fix message wording

closes #9532

Switch to main repo once Mooncake-Labs/pg_mooncake#3 is merged

## Problem See #9458 This PR separates PS related changes in #9458 from compute_ctl changes to enforce that PS is deployed before compute. ## Summary of changes This PR adds handlings of `--replica` parameters of backebackup to page server. ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist --------- Co-authored-by: Konstantin Knizhnik <[email protected]>

…#9134) part of #8921, #9114 ## Summary of changes We start the partial compaction implementation with the image layer partial generation. The partial compaction API now takes a key range. We will only generate images for that key range for now, and remove layers fully included in the key range after compaction. --------- Signed-off-by: Alex Chi Z <[email protected]> Co-authored-by: Christian Schwarz <[email protected]>

This patch contains various improvements for the pagectl tool. ## Summary of changes * Rewrite layer name parsing: LayerName now supports all variants we use now. * Drop pagectl's own layer parsing function, use LayerName in the pageserver crate. * Support image layer dumping in the layer dump command using ImageLayer::dump, drop the original implementation. Signed-off-by: Alex Chi Z <[email protected]>

Should help us keep non-working metrics from hitting staging or production. Co-authored-by: Heikki Linnakangas <[email protected]> Fixes: #8569 Signed-off-by: Tristan Partin <[email protected]>

## Problem When tenant manifest objects are written without a generation suffix, concurrently attached pageservers may stamp on each others writes of the manifest and cause undefined behavior. Closes: #9543 ## Summary of changes - Use download_generation_object helper when reading manifests, to search for the most recent generation - Use Tenant::generation as the generation suffix when writing manifests.

In July of 2023, Bojan and Chi authored 92aee7e. Our in production pageservers are most definitely at a version where they all support gzipped basebackups.

## Problem There were some critical breaking changes made in the upstream since Oct 29th morning. ## Summary of changes Point it to the topmost commit in the `neon` branch at the time of writing this https://github.com/Mooncake-Labs/pg_mooncake/commits/neon/ Mooncake-Labs/pg_mooncake@c495cd1

## Problem We don't have a convenient way to generate WAL records for benchmarks and tests. ## Summary of changes Adds a WAL generator, exposed as an iterator. It currently only generates logical messages (noops), but will be extended to write actual table rows later. Some existing code for WAL generation has been replaced with this generator, to reduce duplication.

This will tell us how much time the compute has spent throttled if pageserver/safekeeper cannot keep up with WAL generation. Signed-off-by: Tristan Partin <[email protected]>

Disallow a request for timeline ancestor detach if either the to be detached timeline, or any of the to be reparented timelines are offloaded or archived. In theory we could support timelines that are archived but not offloaded, but archived timelines are at the risk of being offloaded, so we treat them like offloaded timelines. As for offloaded timelines, any code to "support" them would amount to unoffloading them, at which point we can just demand to have the timelines be unarchived. Part of #8088

Constructing a remote client is no big deal. Yes, it means an extra download from S3 but it's not that expensive. This simplifies code paths and scenarios to test. This unifies timelines that have been recently offloaded with timelines that have been offloaded in an earlier invocation of the process. Part of #8088

… cache (#9470) In #9032, I would like to eventually add a `generation` field to the consumption metrics cache. The current encoding is not backward compatible and it is hard to add another field into the cache. Therefore, this patch refactors the format to store "field -> value", and it's easier to maintain backward/forward compatibility with this new format. ## Summary of changes * Add `NewRawMetric` as the new format. * Add upgrade path. When opening the disk cache, the codepath first inspects the `version` field, and decide how to decode. * Refactor metrics generation code and tests. * Add tests on upgrade / compatibility with the old format. --------- Signed-off-by: Alex Chi Z <[email protected]>

If we delete a timeline that has childen, those children will have their data corrupted. Therefore, extend the already existing safety check to offloaded timelines as well. Part of #8088

…9524) ## Problem Decoding and ingestion are still coupled in `pageserver::WalIngest`. ## Summary of changes A new type is added to `wal_decoder::models`, InterpretedWalRecord. This type contains everything that the pageserver requires in order to ingest a WAL record. The highlights are the `metadata_record` which is an optional special record type to be handled and `blocks` which stores key, value pairs to be persisted to storage. This type is produced by `wal_decoder::models::InterpretedWalRecord::from_bytes` from a raw PG wal record. The rest of this commit separates decoding and interpretation of the PG WAL record from its application in `WalIngest::ingest_record`. Related: #9335 Epic: #9329

`DeletionQueue::new()` always returns deletion workers, so the returned `Option` is redundant.

- pg_jsonschema 0.3.3 - pg_graphql 1.5.9 - rum 65e0a752 - pg_tiktoken a5bc447e update support of extensions for v14-v16: - pg_jsonschema 0.3.1 -> 0.3.3 - pg_graphql 1.5.7 -> 1.5.9 - rum 6ab37053 -> 65e0a752 - pg_tiktoken e64e55aa -> a5bc447e

## Problem clickbench regression causes clickbench to run >9 hours and the AWS session token is expired before the run completes ## Summary of changes extend lifetime of session token for this job to 12 hours

## Problem The final part of #9543 will be a chaos test that creates/deletes/archives/offloads timelines while restarting pageservers and migrating tenants. Developing that test showed up a few places where we log errors during normal shutdown. ## Summary of changes - UninitializedTimeline's drop should log at info severity: this is a normal code path when some part of timeline creation encounters a cancellation `?` path. - When offloading and finding a `RemoteTimelineClient` in a non-initialized state, this is not an error and should not be logged as such. - The `offload_timeline` function returned an anyhow error, so callers couldn't gracefully pick out cancellation errors from real errors: update this to have a structured error type and use it throughout.

our current metrics for http pool opened connections is always negative :D oops

Updates the extension to use pgrx 0.12. No changes to the extensions have been made, the only difference is the pgrx version.

## Problem We wish for the deployment orchestrator to use infra scoped tokens, but storcon endpoints it's using require admin scoped tokens. ## Summary of Changes Switch over all endpoints that are used by the deployment orchestrator to use an infra scoped token. This causes no breakage during mixed version scenarios because admin scoped tokens allow access to all endpoints. The deployment orchestrator can cut over to the infra token after this commit touches down in prod. Once this commit is released we should also update the tests code to use infra scoped tokens where appropriate. Currently it would fail on the [compat tests](https://github.com/neondatabase/neon/blob/9761b6a64e80a4e8bce4b00afce5c2c4f6b825bd/test_runner/regress/test_storage_controller.py#L69-L71).

fix #9601

## Problem `tenant_get_shards()` does not work with a sharded tenant on 1 pageserver, as it assumes an unsharded tenant in this case. This special case appears to have been added to handle e.g. `test_emergency_mode`, where the storage controller is stopped. This breaks e.g. the sharded ingest benchmark in #9591 when run with a single shard. ## Summary of changes Correctly look up shards even with a single pageserver, but add a special case that assumes an unsharded tenant if the storage controller is stopped and the caller provides an explicit pageserver, in order to accomodate `test_emergency_mode`.

## Problem The IAM role associated with our github action runner supports a max token expiration which is lower than the value we tried. ## Summary of changes Since we believe to have understood the performance regression we (by ensuring availability zone affinity of compute and pageserver) the job should again run in lower than 5 hours and we revert this change instead of increasing the max session token expiration in the IAM role which would reduce our security.

## Problem #9524 split the decoding and interpretation step from ingestion. The output of the first phase is a `wal_decoder::models::InterpretedWalRecord`. Before this patch set that struct contained a list of `Value` instances. We wish to lift the decoding and interpretation step to the safekeeper, but it would be nice if the safekeeper gave us a batch containing the raw data instead of actual values. ## Summary of changes Main goal here is to make `InterpretedWalRecord` hold a raw buffer which contains pre-serialized Values. For this we do: 1. Add a `SerializedValueBatch` type. This is `inmemory_layer::SerializedBatch` with some extra functionality for extension, observing values for shard 0 and tests. 2. Replace `inmemory_layer::SerializedBatch` with `SerializedValueBatch` 3. Make `DatadirModification` maintain a `SerializedValueBatch`. ### `DatadirModification` changes `DatadirModification` now maintains a `SerializedValueBatch` and extends it as new WAL records come in (to avoid flushing to disk on every record). In turn, this cascaded into a number of modifications to `DatadirModification`: 1. Replace `pending_data_pages` and `pending_zero_data_pages` with `pending_data_batch`. 2. Removal of `pending_zero_data_pages` and its cousin `on_wal_record_end` 3. Rename `pending_bytes` to `pending_metadata_bytes` since this is what it tracks now. 4. Adapting of various utility methods like `len`, `approx_pending_bytes` and `has_dirty_data_pages`. Removal of `pending_zero_data_pages` and the optimisation associated with it ((1) and (2)) deserves more detail. Previously all zero data pages went through `pending_zero_data_pages`. We wrote zero data pages when filling gaps caused by relation extension (case A) and when handling special wal records (case B). If it happened that the same WAL record contained a non zero write for an entry in `pending_zero_data_pages` we skipped the zero write. Case A: We handle this differently now. When ingesting the `SerialiezdValueBatch` associated with one PG WAL record, we identify the gaps and fill the them in one go. Essentially, we move from a per key process (gaps were filled after each new key), and replace it with a per record process. Hence, the optimisation is not required anymore. Case B: When the handling of a special record needs to zero out a key, it just adds that to the current batch. I inspected the code, and I don't think the optimisation kicked in here.

## Problem Pinning a tenant by setting Pause scheduling policy doesn't work because drain/fill code moves the tenant around during deploys. Closes: #9612 ## Summary of changes - In drain, only move a tenant if it is in Active or Essential mode - In fill, only move a tenant if it is in Active mode. The asymmetry is a bit annoying, but it faithfully respects the purposes of the modes: Essential is meant to endeavor to keep the tenant available, which means it needs to be drained but doesn't need to be migrated during fills.

I think I meant to make these changes over 6 months ago. alas, better late than never. 1. should_reject doesn't eagerly intern the endpoint string 2. Rate limiter uses a std Mutex instead of a tokio Mutex. 3. Recently I introduced a `-local-proxy` endpoint suffix. I forgot to add this to normalize. 4. Random but a small cleanup making the ControlPlaneEvent deser directly to the interned strings.

In #9441, the tenant has a lot of aux keys spread in multiple aux files. The perf tool shows that a significant amount of time is spent on remove_overlapping_keys. For sparse keyspaces, we don't need to report missing key errors anyways, and it's very likely that we will need to read all layers intersecting with the key range. Therefore, this patch adds a new fast path for sparse keyspace reads that we do not track `unmapped_keyspace` in a fine-grained way. We only modify it when we find an image layer. In debug mode, it was ~5min to read the aux files for a dump of the tenant, and now it's only 8s, that's a 60x speedup. ## Summary of changes * Do not add sparse keys into `keys_done` so that remove_overlapping does nothing. * Allow `ValueReconstructSituation::Complete` to be updated again in `ValuesReconstructState::update_key` for sparse keyspaces. --------- Signed-off-by: Alex Chi Z <[email protected]>

…ottling_seconds This is in line with the Prometheus guidance[0]. We also haven't started using this metric, so renaming is essentially free. Link: https://prometheus.io/docs/practices/naming/ [0] Signed-off-by: Tristan Partin <[email protected]>

…layers protected by leases (#9551) Fixes #9518. ## Problem After removing the assertion `layers_removed == 0` in #9506, we could miss breakage if we solely rely on the successful execution of the `SELECT` query to check if lease is properly protecting layers. Details listed in #9518. Also, in integration tests, we sometimes run into the race condition where getpage request comes before the lease get renewed (item 2 of #8817), even if compute_ctl sends a lease renewal as soon as it sees a `/configure` API calls that updates the `pageserver_connstr`. In this case, we would observe a getpage request error stating that we `tried to request a page version that was garbage collected` (as we seen in [Allure Report](https://neon-github-public-dev.s3.amazonaws.com/reports/pr-8613/11550393107/index.html#suites/3ccffb1d100105b98aed3dc19b717917/d1a1ba47bc180493)). ## Summary of changes - Use layer map dump to verify if the lease protects what it claimed: Record all historical layers that has `start_lsn <= lease_lsn` before and after running timeline gc. This is the same check as https://github.com/neondatabase/neon/blob/ad79f4246030583e5af418cb087bf7582266accc/pageserver/src/tenant/timeline.rs#L5025-L5027 The set recorded after GC should contain every layer in the set recorded before GC. - Wait until log contains another successful lease request before running the `SELECT` query after GC. We argued in #8817 that the bad request can only exist within a short period after migration/restart, and our test shows that as long as a lease renewal is done before the first getpage request sent after reconfiguration, we will not have bad request. Signed-off-by: Yuchen Liang <[email protected]>

close #9460 ## Summary of changes A full rewrite of pagestream cmdline parsing to make it more robust and readable. --------- Signed-off-by: Alex Chi Z <[email protected]> Co-authored-by: Arpad Müller <[email protected]>

…9669) We need to use the shard associated with the layer file, not the shard associated with our current tenant shard ID. Due to shard splits, the shard IDs can refer to older files. close #9667

github-actions · 2024-11-07T06:59:46Z

5337 tests run: 5115 passed, 0 failed, 222 skipped (full report)

Code coverage* (full report)

functions: 31.7% (7847 of 24761 functions)
lines: 49.4% (62084 of 125797 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
e1d0b73 at 2024-11-07T06:59:45.670Z :recycle:}

awarus · 2024-11-07T09:18:29Z

relevant changes:

awarus

Failed only the following tests:

benchmarks (4, release): pageserver stop failed: pageserver with pid 1433 did not stop in 10s seconds
benchmarks (5, release): pageserver start failed: pageserver did not start+pass status checks within 10s seconds

Suppose that they're not critical

danieltprice · 2024-11-14T22:53:02Z

Reviewed for changelog.

conradludgate and others added 30 commits October 29, 2024 11:39

fix allow_unstable_extensions GUC - make it USERSET (#9563)

793ad50

fix message wording

proxy: more granular http status codes for sql-over-http errors (#9549)

5749964

closes #9532

Use pg_mooncake from our fork. (#9565)

80e1630

Switch to main repo once Mooncake-Labs/pg_mooncake#3 is merged

Add tests for sql-exporter metrics

b77b9bd

Should help us keep non-working metrics from hitting staging or production. Co-authored-by: Heikki Linnakangas <[email protected]> Fixes: #8569 Signed-off-by: Tristan Partin <[email protected]>

Remove non-gzipped basebackup code path

0c828c5

In July of 2023, Bojan and Chi authored 92aee7e. Our in production pageservers are most definitely at a version where they all support gzipped basebackups.

Collect compute backpressure throttling time

8af9412

This will tell us how much time the compute has spent throttled if pageserver/safekeeper cannot keep up with WAL generation. Signed-off-by: Tristan Partin <[email protected]>

Disallow offloaded children during timeline deletion (#9582)

65b6939

If we delete a timeline that has childen, those children will have their data corrupted. Therefore, extend the already existing safety check to offloaded timelines as well. Part of #8088

pageserver: don't return option from DeletionQueue::new (#9588)

f9d8256

`DeletionQueue::new()` always returns deletion workers, so the returned `Option` is redundant.

increase lifetime of AWS session token to 12 hours (#9590)

51fda11

## Problem clickbench regression causes clickbench to run >9 hours and the AWS session token is expired before the run completes ## Summary of changes extend lifetime of session token for this job to 12 hours

auth_broker: fix local_proxy conn count (#9593)

897cffb

our current metrics for http pool opened connections is always negative :D oops

update pg_session_jwt to use pgrx 0.12 for pg17 (#9595)

9761b6a

Updates the extension to use pgrx 0.12. No changes to the extensions have been made, the only difference is the pgrx version.

fix pre-commit hook with python stubs (#9602)

2d1366c

fix #9601

cloneable and others added 11 commits November 6, 2024 10:27

proxy: Read cplane JWT with clap arg (#9654)

5d8284c

proxy: Replace "web(auth)" with "console redirect" everywhere (#9655)

bdd492b

refactor(pageserver): better pageservice command parsing (#9597)

2a95a51

close #9460 ## Summary of changes A full rewrite of pagestream cmdline parsing to make it more robust and readable. --------- Signed-off-by: Alex Chi Z <[email protected]> Co-authored-by: Arpad Müller <[email protected]>

Support copying layers in detach_ancestor from before shard splits (#…

011c0a1

…9669) We need to use the shard associated with the layer file, not the shard associated with our current tenant shard ID. Due to shard splits, the shard IDs can refer to older files. close #9667

chore(compute): Bump pg_mooncake to the latest version

e1d0b73

vipvap requested review from a team as code owners November 7, 2024 06:02

vipvap requested review from MMeent, yliang412, cloneable and NanoBjorn and removed request for a team November 7, 2024 06:02

awarus self-requested a review November 7, 2024 09:04

awarus approved these changes Nov 7, 2024

View reviewed changes

awarus merged commit f5eec19 into release-proxy Nov 7, 2024
186 of 188 checks passed

awarus deleted the rc/proxy/2024-11-07 branch November 7, 2024 10:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proxy release 2024-11-07 #9674

Proxy release 2024-11-07 #9674

vipvap commented Nov 7, 2024

github-actions bot commented Nov 7, 2024

awarus commented Nov 7, 2024

awarus left a comment

danieltprice commented Nov 14, 2024 •

edited

Loading

Proxy release 2024-11-07 #9674

Proxy release 2024-11-07 #9674

Conversation

vipvap commented Nov 7, 2024