Run loop additional metrics #2888

squadgazzz · 2024-08-13T14:02:51Z

Description

In order to move towards one price per token per block we need to remove inefficiencies in our run loop to make sure it can run within 2-4s (to leave sufficient time for the solvers).

Changes

Achieves the following:

How long after the block's timestamp does the run loop start: single_run_delay metric.
How long does it take until solve is called on the solvers: auction_preprocessing_time metric.
How long does solver take: This can be done using the existing solve metrics(by calculating the MAX value among solvers for the last N seconds).
How long does reveal take: Seems like the current reveal metric wasn't used, so I updated its type to Histogram.
How long does post-processing (writing the competition to the DB) take: auction_postprocessing_time metric.
How long does settle take: The existing histogram metric was updated by incrementing the counter by the elapsed time and used in the following queries only.

Dashboard: Alerts (Prod), Panel: Failing Settlements
Query: sum by (network) (rate(gp_v2_autopilot_runloop_settle{result="success"}[10m])) / sum by (network) (rate(gp_v2_autopilot_runloop_settle{}[10m]))

Dashboard: Alerts (Prod), Panel: Failing Settlements
Query: ((sum by (network) (rate(gp_v2_autopilot_runloop_settle{result="success"}[10m])) / sum by (network) (rate(gp_v2_autopilot_runloop_settle{}[10m])))) and (sum by (network) (rate(gp_v2_autopilot_runloop_settle{}[10m])*600) > 5)

Now it records individual elapsed times and the existing queries should continue to work after appending _sum postfix to the metric name.

Also, I added single_run_time and auction_update_time metrics for better visibility to avoid accumulating all the values from different sources. So the total round trip could be calculated as single_run_time + auction_update_time.

I used separate Histogram metrics since that might be not really suitable to have all of them on a single panel due to different values(from milliseconds to 5-10 seconds or so).

How to test

The plan was to deploy a temp image on staging.

Related Issues

Fixes #2859

fleupold

Some important metrics are missing imo:

Time before competition is started
Roundtrip time for reveal
Time for all the persistence work we do before calling settle (maybe extract into onw method to keep single_run smaller)
Roundtrip time for settle

Once all those metrics are in place, can you add them to the GPv2 Grafana dashboard or create a new dashboard specifically for the autopilot runloop performance?

crates/autopilot/src/run_loop.rs

fleupold · 2024-08-14T07:41:47Z

Also, I think it would be useful to measure the time compared to the current block's timestamp (so that we can also measure the delay with which our run loop starts).

So the stats we should have is:

How long after the block's timestamp does the run loop start
How long does it take until solve is called on the solvers
How long does solver take
How long does reveal take
How long does post-processing (writing the competition to the DB) take
How long does settle take

# Conflicts: # crates/autopilot/src/solvable_orders.rs

squadgazzz · 2024-08-14T13:51:59Z

Also, I think it would be useful to measure the time compared to the current block's timestamp (so that we can also measure the delay with which our run loop starts).

Updated everything including the PR description.

fleupold

Can be a separate PR, but I'd still be able in measuring our delay with which we even start a single run compared to the timestamp on the block.

crates/autopilot/src/run_loop.rs

fleupold · 2024-08-14T18:21:34Z

crates/autopilot/src/run_loop.rs

@@ -316,6 +322,7 @@ impl RunLoop {
                Metrics::fee_policies_store_error();
                tracing::warn!(?err, "failed to save fee policies");
            }
+            Metrics::competition_stored(start.elapsed());


Maybe we can move everything between the first start and this into a prost_processing method to keep single_run a bit more readable?

You mean the actual code, not the metrics, right? I will open a separate PR since it would be easier to read the actual changes.

crates/autopilot/src/solvable_orders.rs

squadgazzz · 2024-08-15T06:15:45Z

Can be a separate PR, but I'd still be able in measuring our delay with which we even start a single run compared to the timestamp on the block.

Added this also. single_run_delay

MartinquaXD

Run loop instrumentation looks alright. But more metrics on the details of building the auction would be nice.

crates/autopilot/src/run_loop.rs

MartinquaXD · 2024-08-15T06:50:53Z

crates/autopilot/src/solvable_orders.rs

@@ -304,6 +308,9 @@ impl SolvableOrdersCache {
        };

        tracing::debug!(%block, "updated current auction cache");
+        self.metrics


GIven that we know that this takes a significant amount of time we could already add metrics for the individual stages of the auction building.
I assume most of the time will likely be spent on the DB query but there will probably also be outliers in the individual steps that need to be ironed out.

Added a HistogramVec for individual update stages except solvable order fetching since we already have a separate DB metric for this.

crates/autopilot/src/solvable_orders.rs

fleupold · 2024-08-15T11:51:13Z

crates/autopilot/src/run_loop.rs

+        &self,
+        auction_id: domain::auction::Id,
+        auction: &domain::Auction,
+        init_block_timestamp: u64,


nit: If this variable is only passed in to trigger the metric, we could probably also simply do it where we call single_run

To fetch the latest block timestamp or what exactly? That would work as long as the auction update function takes less than 1 round. For arbitrum, this is not the case.

crates/autopilot/src/run_loop.rs

Run loop additional metrics

97b4005

squadgazzz requested a review from a team as a code owner August 13, 2024 14:02

fleupold reviewed Aug 13, 2024

View reviewed changes

crates/autopilot/src/run_loop.rs Outdated Show resolved Hide resolved

crates/autopilot/src/run_loop.rs Outdated Show resolved Hide resolved

Merge branch 'main' into 2859/run-loop-metrics

2d72a2f

# Conflicts: # crates/autopilot/src/solvable_orders.rs

squadgazzz marked this pull request as draft August 14, 2024 11:03

squadgazzz added 3 commits August 14, 2024 16:08

Updated metrics

26e1527

Single run refactored

796e822

auction_preprocessing_time

290c52f

squadgazzz marked this pull request as ready for review August 14, 2024 13:52

fleupold reviewed Aug 14, 2024

View reviewed changes

Naming

2b7fcae

squadgazzz mentioned this pull request Aug 15, 2024

Runloop post processing function #2895

Merged

MartinquaXD reviewed Aug 15, 2024

View reviewed changes

squadgazzz added 4 commits August 15, 2024 10:58

Auction update individual stages time

b80eff3

Bucket

3e0b450

Non-optional init block timestamp

1953367

Missing metric label

0a8d18e

MartinquaXD reviewed Aug 15, 2024

View reviewed changes

crates/autopilot/src/solvable_orders.rs Outdated Show resolved Hide resolved

fleupold approved these changes Aug 15, 2024

View reviewed changes

squadgazzz added 3 commits August 15, 2024 15:16

Stage timer function

f05901d

Buckets

827499b

Merge branch 'main' into 2859/run-loop-metrics

26df6e2

MartinquaXD approved these changes Aug 15, 2024

View reviewed changes

Merge branch 'main' into 2859/run-loop-metrics

3243d9a

squadgazzz enabled auto-merge (squash) August 16, 2024 06:43

squadgazzz merged commit c3bbfb8 into main Aug 16, 2024
10 checks passed

squadgazzz deleted the 2859/run-loop-metrics branch August 16, 2024 06:49

github-actions bot locked and limited conversation to collaborators Aug 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run loop additional metrics #2888

Run loop additional metrics #2888

squadgazzz commented Aug 13, 2024 •

edited

Loading

fleupold left a comment

fleupold commented Aug 14, 2024

squadgazzz commented Aug 14, 2024

fleupold left a comment

fleupold Aug 14, 2024

squadgazzz Aug 15, 2024

squadgazzz Aug 15, 2024

squadgazzz commented Aug 15, 2024 •

edited

Loading

MartinquaXD left a comment

MartinquaXD Aug 15, 2024

squadgazzz Aug 15, 2024 •

edited

Loading

fleupold Aug 15, 2024

squadgazzz Aug 15, 2024

Run loop additional metrics #2888

Run loop additional metrics #2888

Conversation

squadgazzz commented Aug 13, 2024 • edited Loading

Description

Changes

How to test

Related Issues

fleupold left a comment

Choose a reason for hiding this comment

fleupold commented Aug 14, 2024

squadgazzz commented Aug 14, 2024

fleupold left a comment

Choose a reason for hiding this comment

fleupold Aug 14, 2024

Choose a reason for hiding this comment

squadgazzz Aug 15, 2024

Choose a reason for hiding this comment

squadgazzz Aug 15, 2024

Choose a reason for hiding this comment

squadgazzz commented Aug 15, 2024 • edited Loading

MartinquaXD left a comment

Choose a reason for hiding this comment

MartinquaXD Aug 15, 2024

Choose a reason for hiding this comment

squadgazzz Aug 15, 2024 • edited Loading

Choose a reason for hiding this comment

fleupold Aug 15, 2024

Choose a reason for hiding this comment

squadgazzz Aug 15, 2024

Choose a reason for hiding this comment

squadgazzz commented Aug 13, 2024 •

edited

Loading

squadgazzz commented Aug 15, 2024 •

edited

Loading

squadgazzz Aug 15, 2024 •

edited

Loading