Add per task report upload metrics (0.6 backport) #2513

inahga · 2024-01-18T20:46:18Z

Supports #2293

Backport of #2508 and #2537.

This PR should not be merged until #2553 is merged and in a release. In other words, these PRs should not be in the same release.

inahga · 2024-01-18T20:46:49Z

Using this for load testing, since 0.7 is WIP. In reality I think this will need to be two PRs over two releases, to account for DB migration.

inahga · 2024-01-22T21:39:55Z

Load test results:

Each trial is the last 15 minutes of our standard load test run at 100QPS. The task is Time Interval Prio3Count. This should exercise the slowest path--validation that requires DB access and a successful report.

System details:

CPU: AMD Ryzen 9 7950X 16-Core Processor (16C/32T)
Memory: 64GB DDR5
Disk: Samsung SSD 980 PRO 1TB NVMe
rustc 1.75.0-alpine

100QPS Baseline (Janus release 0.6.9)

Trial 1:

Trial 2:

n.b. there is no transaction error graph since this transaction can't fail due to serialization errors.

100QPS 1 shard

Trial 1:

Trial 2:

100QPS 32 shards

Trial 1:

Trial 2:

Overall the results look to be non-regressive, as long as the shard count is greater than 1 to avoid serialization errors.

Note that these graphs record the total memory usage of the postgresql container, but I think that turned out to be meaningless as it includes the memory used for buffer. Either way, I don't observe any strange behavior wrt memory.

Note that some graphs contain a period of high latency--this is noise from the kubernetes cluster HPA not having scaled the aggregator deployment to a happy st ate.

inahga · 2024-01-22T21:41:21Z

I'll need to break this PR into two to facilitate a zero downtime PostgreSQL migration.

inahga · 2024-01-26T17:56:37Z

I'll need to break this PR into two to facilitate a zero downtime PostgreSQL migration.

Actually, I don't think this is true.

Our rollout cadence is like so:

Roll out new version of Janus.
Execute DB migrations.

Note that we don't wait for 1's success during deployment. The pods will remain in a crash loop until the database has been updated. Meanwhile, traffic won't be shifted to the new deployment.

divergentdave · 2024-01-26T18:06:18Z

I think it would be preferable to do this in two deploys as follows:

Update the supported schema versions to (2, 1), and add the new table
Update the supported schema versions to (2), and make the rest of the code changes depending on the task_upload_counters table

If we did this in a single deploy, note that the old ReplicaSet wouldn't be able to start any new pods until the schema migration was applied. If we encountered issues during the deploy, this would complicate the response.

inahga · 2024-01-26T18:12:25Z

If we did this in a single deploy, note that the old ReplicaSet wouldn't be able to start any new pods until the schema migration was applied. If we encountered issues during the deploy, this would complicate the response.

Fair enough. I was relying on the "old replica set won't start new pods" behavior to make this work, but that is indeed chaotic.

* Add per task report upload metrics. * Change default to 32, add documentation on option * Fix test * Build query instead of brute forcing each possible one * Don't wait on bad reports * Use Runtime and RuntimeManager instead of sleeping in tests * Clippy * Cargo doc * Don't use macro needlessly Co-authored-by: Brandon Pitman <[email protected]> --------- Co-authored-by: Brandon Pitman <[email protected]>

Don't change existing schema

inahga force-pushed the inahga/metrics-0.6 branch 2 times, most recently from f3bcc7f to 86e70e2 Compare January 22, 2024 18:38

inahga mentioned this pull request Jan 22, 2024

Add per-task report upload metrics #2508

Merged

inahga marked this pull request as ready for review January 22, 2024 21:40

inahga requested a review from a team as a code owner January 22, 2024 21:40

inahga marked this pull request as draft January 22, 2024 21:40

inahga force-pushed the inahga/metrics-0.6 branch from ec1bf58 to aa76c41 Compare January 26, 2024 18:10

inahga force-pushed the inahga/metrics-0.6 branch from aa76c41 to 6480c91 Compare January 26, 2024 18:15

inahga changed the base branch from release/0.6 to inahga/metrics-0.6-schema January 26, 2024 18:15

inahga force-pushed the inahga/metrics-0.6 branch from 6480c91 to f910486 Compare January 26, 2024 18:17

inahga marked this pull request as ready for review January 26, 2024 18:20

Base automatically changed from inahga/metrics-0.6-schema to release/0.6 January 26, 2024 18:51

branlwyd approved these changes Jan 29, 2024

View reviewed changes

tgeoghegan mentioned this pull request Jan 30, 2024

get_task_metrics() is expensive and must be destroyed #2360

Closed

inahga and others added 4 commits February 5, 2024 11:31

Expose upload metrics through aggregator API (#2537)

82588e0

0.6 specific fixes

aaff6c9

Don't change existing schema

Update schema version

401e53d

inahga force-pushed the inahga/metrics-0.6 branch from f910486 to 401e53d Compare February 5, 2024 16:32

inahga enabled auto-merge (squash) February 5, 2024 16:32

inahga merged commit 9cf74fe into release/0.6 Feb 5, 2024
8 checks passed

inahga deleted the inahga/metrics-0.6 branch February 5, 2024 17:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add per task report upload metrics (0.6 backport) #2513

Add per task report upload metrics (0.6 backport) #2513

inahga commented Jan 18, 2024 •

edited

Loading

inahga commented Jan 18, 2024

inahga commented Jan 22, 2024

inahga commented Jan 22, 2024

inahga commented Jan 26, 2024

divergentdave commented Jan 26, 2024

inahga commented Jan 26, 2024

Add per task report upload metrics (0.6 backport) #2513

Add per task report upload metrics (0.6 backport) #2513

Conversation

inahga commented Jan 18, 2024 • edited Loading

inahga commented Jan 18, 2024

inahga commented Jan 22, 2024

inahga commented Jan 22, 2024

inahga commented Jan 26, 2024

divergentdave commented Jan 26, 2024

inahga commented Jan 26, 2024

inahga commented Jan 18, 2024 •

edited

Loading