Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add per task report upload metrics (0.6 backport) #2513

Merged
merged 4 commits into from
Feb 5, 2024

Conversation

inahga
Copy link
Contributor

@inahga inahga commented Jan 18, 2024

Supports #2293

Backport of #2508 and #2537.

This PR should not be merged until #2553 is merged and in a release. In other words, these PRs should not be in the same release.

@inahga
Copy link
Contributor Author

inahga commented Jan 18, 2024

Using this for load testing, since 0.7 is WIP. In reality I think this will need to be two PRs over two releases, to account for DB migration.

@inahga inahga force-pushed the inahga/metrics-0.6 branch 2 times, most recently from f3bcc7f to 86e70e2 Compare January 22, 2024 18:38
@inahga
Copy link
Contributor Author

inahga commented Jan 22, 2024

Load test results:

Each trial is the last 15 minutes of our standard load test run at 100QPS. The task is Time Interval Prio3Count. This should exercise the slowest path--validation that requires DB access and a successful report.

System details:
  • CPU: AMD Ryzen 9 7950X 16-Core Processor (16C/32T)
  • Memory: 64GB DDR5
  • Disk: Samsung SSD 980 PRO 1TB NVMe
  • rustc 1.75.0-alpine
100QPS Baseline (Janus release 0.6.9)

Trial 1:
image

Trial 2:
image

n.b. there is no transaction error graph since this transaction can't fail due to serialization errors.

100QPS 1 shard

Trial 1:
image

Trial 2:
image

100QPS 32 shards

Trial 1:
image

Trial 2:
image

Overall the results look to be non-regressive, as long as the shard count is greater than 1 to avoid serialization errors.

Note that these graphs record the total memory usage of the postgresql container, but I think that turned out to be meaningless as it includes the memory used for buffer. Either way, I don't observe any strange behavior wrt memory.

Note that some graphs contain a period of high latency--this is noise from the kubernetes cluster HPA not having scaled the aggregator deployment to a happy st ate.

@inahga inahga marked this pull request as ready for review January 22, 2024 21:40
@inahga inahga requested a review from a team as a code owner January 22, 2024 21:40
@inahga inahga marked this pull request as draft January 22, 2024 21:40
@inahga
Copy link
Contributor Author

inahga commented Jan 22, 2024

I'll need to break this PR into two to facilitate a zero downtime PostgreSQL migration.

@inahga
Copy link
Contributor Author

inahga commented Jan 26, 2024

I'll need to break this PR into two to facilitate a zero downtime PostgreSQL migration.

Actually, I don't think this is true.

Our rollout cadence is like so:

  1. Roll out new version of Janus.
  2. Execute DB migrations.

Note that we don't wait for 1's success during deployment. The pods will remain in a crash loop until the database has been updated. Meanwhile, traffic won't be shifted to the new deployment.

@divergentdave
Copy link
Collaborator

I think it would be preferable to do this in two deploys as follows:

  1. Update the supported schema versions to (2, 1), and add the new table
  2. Update the supported schema versions to (2), and make the rest of the code changes depending on the task_upload_counters table

If we did this in a single deploy, note that the old ReplicaSet wouldn't be able to start any new pods until the schema migration was applied. If we encountered issues during the deploy, this would complicate the response.

@inahga inahga force-pushed the inahga/metrics-0.6 branch from ec1bf58 to aa76c41 Compare January 26, 2024 18:10
@inahga
Copy link
Contributor Author

inahga commented Jan 26, 2024

If we did this in a single deploy, note that the old ReplicaSet wouldn't be able to start any new pods until the schema migration was applied. If we encountered issues during the deploy, this would complicate the response.

Fair enough. I was relying on the "old replica set won't start new pods" behavior to make this work, but that is indeed chaotic.

@inahga inahga force-pushed the inahga/metrics-0.6 branch from aa76c41 to 6480c91 Compare January 26, 2024 18:15
@inahga inahga changed the base branch from release/0.6 to inahga/metrics-0.6-schema January 26, 2024 18:15
@inahga inahga force-pushed the inahga/metrics-0.6 branch from 6480c91 to f910486 Compare January 26, 2024 18:17
@inahga inahga marked this pull request as ready for review January 26, 2024 18:20
Base automatically changed from inahga/metrics-0.6-schema to release/0.6 January 26, 2024 18:51
inahga and others added 4 commits February 5, 2024 11:31
* Add per task report upload metrics.

* Change default to 32, add documentation on option

* Fix test

* Build query instead of brute forcing each possible one

* Don't wait on bad reports

* Use Runtime and RuntimeManager instead of sleeping in tests

* Clippy

* Cargo doc

* Don't use macro needlessly

Co-authored-by: Brandon Pitman <[email protected]>

---------

Co-authored-by: Brandon Pitman <[email protected]>
Don't change existing schema
@inahga inahga force-pushed the inahga/metrics-0.6 branch from f910486 to 401e53d Compare February 5, 2024 16:32
@inahga inahga enabled auto-merge (squash) February 5, 2024 16:32
@inahga inahga merged commit 9cf74fe into release/0.6 Feb 5, 2024
8 checks passed
@inahga inahga deleted the inahga/metrics-0.6 branch February 5, 2024 17:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants