Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run loop additional metrics #2888

Merged
merged 14 commits into from
Aug 16, 2024
Merged

Run loop additional metrics #2888

merged 14 commits into from
Aug 16, 2024

Conversation

squadgazzz
Copy link
Contributor

@squadgazzz squadgazzz commented Aug 13, 2024

Description

In order to move towards one price per token per block we need to remove inefficiencies in our run loop to make sure it can run within 2-4s (to leave sufficient time for the solvers).

Changes

Achieves the following:

  1. How long after the block's timestamp does the run loop start: single_run_delay metric.
  2. How long does it take until solve is called on the solvers: auction_preprocessing_time metric.
  3. How long does solver take: This can be done using the existing solve metrics(by calculating the MAX value among solvers for the last N seconds).
  4. How long does reveal take: Seems like the current reveal metric wasn't used, so I updated its type to Histogram.
  5. How long does post-processing (writing the competition to the DB) take: auction_postprocessing_time metric.
  6. How long does settle take: The existing histogram metric was updated by incrementing the counter by the elapsed time and used in the following queries only.
Dashboard: Alerts (Prod), Panel: Failing Settlements
Query: sum by (network) (rate(gp_v2_autopilot_runloop_settle{result="success"}[10m])) / sum by (network) (rate(gp_v2_autopilot_runloop_settle{}[10m]))

Dashboard: Alerts (Prod), Panel: Failing Settlements
Query: ((sum by (network) (rate(gp_v2_autopilot_runloop_settle{result="success"}[10m])) / sum by (network) (rate(gp_v2_autopilot_runloop_settle{}[10m])))) and (sum by (network) (rate(gp_v2_autopilot_runloop_settle{}[10m])*600) > 5)

Now it records individual elapsed times and the existing queries should continue to work after appending _sum postfix to the metric name.

Also, I added single_run_time and auction_update_time metrics for better visibility to avoid accumulating all the values from different sources. So the total round trip could be calculated as single_run_time + auction_update_time.

I used separate Histogram metrics since that might be not really suitable to have all of them on a single panel due to different values(from milliseconds to 5-10 seconds or so).

How to test

The plan was to deploy a temp image on staging.

Related Issues

Fixes #2859

@squadgazzz squadgazzz requested a review from a team as a code owner August 13, 2024 14:02
Copy link
Contributor

@fleupold fleupold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some important metrics are missing imo:

  • Time before competition is started
  • Roundtrip time for reveal
  • Time for all the persistence work we do before calling settle (maybe extract into onw method to keep single_run smaller)
  • Roundtrip time for settle

Once all those metrics are in place, can you add them to the GPv2 Grafana dashboard or create a new dashboard specifically for the autopilot runloop performance?

crates/autopilot/src/run_loop.rs Outdated Show resolved Hide resolved
crates/autopilot/src/run_loop.rs Outdated Show resolved Hide resolved
@fleupold
Copy link
Contributor

Also, I think it would be useful to measure the time compared to the current block's timestamp (so that we can also measure the delay with which our run loop starts).

So the stats we should have is:

  1. How long after the block's timestamp does the run loop start
  2. How long does it take until solve is called on the solvers
  3. How long does solver take
  4. How long does reveal take
  5. How long does post-processing (writing the competition to the DB) take
  6. How long does settle take

# Conflicts:
#	crates/autopilot/src/solvable_orders.rs
@squadgazzz squadgazzz marked this pull request as draft August 14, 2024 11:03
@squadgazzz
Copy link
Contributor Author

Also, I think it would be useful to measure the time compared to the current block's timestamp (so that we can also measure the delay with which our run loop starts).

Updated everything including the PR description.

@squadgazzz squadgazzz marked this pull request as ready for review August 14, 2024 13:52
Copy link
Contributor

@fleupold fleupold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be a separate PR, but I'd still be able in measuring our delay with which we even start a single run compared to the timestamp on the block.

crates/autopilot/src/run_loop.rs Outdated Show resolved Hide resolved
crates/autopilot/src/run_loop.rs Outdated Show resolved Hide resolved
@@ -316,6 +322,7 @@ impl RunLoop {
Metrics::fee_policies_store_error();
tracing::warn!(?err, "failed to save fee policies");
}
Metrics::competition_stored(start.elapsed());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can move everything between the first start and this into a prost_processing method to keep single_run a bit more readable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean the actual code, not the metrics, right? I will open a separate PR since it would be easier to read the actual changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

crates/autopilot/src/solvable_orders.rs Show resolved Hide resolved
@squadgazzz
Copy link
Contributor Author

squadgazzz commented Aug 15, 2024

Can be a separate PR, but I'd still be able in measuring our delay with which we even start a single run compared to the timestamp on the block.

Added this also. single_run_delay

Copy link
Contributor

@MartinquaXD MartinquaXD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Run loop instrumentation looks alright. But more metrics on the details of building the auction would be nice.

crates/autopilot/src/run_loop.rs Outdated Show resolved Hide resolved
crates/autopilot/src/run_loop.rs Show resolved Hide resolved
@@ -304,6 +308,9 @@ impl SolvableOrdersCache {
};

tracing::debug!(%block, "updated current auction cache");
self.metrics
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GIven that we know that this takes a significant amount of time we could already add metrics for the individual stages of the auction building.
I assume most of the time will likely be spent on the DB query but there will probably also be outliers in the individual steps that need to be ironed out.

Copy link
Contributor Author

@squadgazzz squadgazzz Aug 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a HistogramVec for individual update stages except solvable order fetching since we already have a separate DB metric for this.

&self,
auction_id: domain::auction::Id,
auction: &domain::Auction,
init_block_timestamp: u64,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: If this variable is only passed in to trigger the metric, we could probably also simply do it where we call single_run

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To fetch the latest block timestamp or what exactly? That would work as long as the auction update function takes less than 1 round. For arbitrum, this is not the case.

crates/autopilot/src/run_loop.rs Outdated Show resolved Hide resolved
@squadgazzz squadgazzz enabled auto-merge (squash) August 16, 2024 06:43
@squadgazzz squadgazzz merged commit c3bbfb8 into main Aug 16, 2024
10 checks passed
@squadgazzz squadgazzz deleted the 2859/run-loop-metrics branch August 16, 2024 06:49
@github-actions github-actions bot locked and limited conversation to collaborators Aug 16, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

chore: Instrument autopilot runloop
3 participants