Metastore client fix + admin test page #3852
Draft
+226
−19
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Background
We have observed metastore client latency spikes during rollouts / leadership changes.
The client relies on error responses containing the new leader, however there is a mismatch in what the client expects and what metastore nodes return.
Problem
The metastore client discovers servers using the grpc port (9095) and maintains a map with addresses such as
pyroscope-metastore-0.pyroscope-metastore-headless.profiles-dev-003:9095
Metastore nodes return the leader using the Raft port (9099) and the
svc.cluster.local
domain:pyroscope-metastore-0.pyroscope-metastore-headless.profiles-dev-003.svc.cluster.local.:9099
This causes a failed check here:
pyroscope/pkg/experiment/metastore/client/methods.go
Lines 83 to 85 in f4cec5d
As a result we resort to random selections until we hit the right node.
Fix
The main options are to “hack” the client and massage the data or change the Raft server identity. This PR does the former, it is however messy and ideally we should align the two components. I will give the second option a try as well.
Either way we go the latency becomes stable, however we can't go lower than 51ms because of the fixed backoff here:
pyroscope/pkg/experiment/metastore/client/methods.go
Line 24 in f4cec5d
For this, we can consider switching to a more aggressive and maybe exponential backoff.
Bonus
I've added a metastore client test page to make testing easier.