-
Notifications
You must be signed in to change notification settings - Fork 198
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revamp metrics to account the status code #320
Conversation
} | ||
return nil, api.NewInvalidArgError(err.Error()) | ||
} | ||
|
||
if s.ratelimiter != nil { | ||
err := s.checkRateLimitsAndAddRates(ctx, blob, origin, authenticatedAddress) | ||
if err != nil { | ||
for _, param := range securityParams { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was wrong accounting: the checkRateLimitsAndAddRates
will error out and return upon the first error (e.g. first quorum), but this accounting increments metrics for each quorum
disperser/apiserver/server.go
Outdated
@@ -169,6 +170,7 @@ func (s *DispersalServer) DisperseBlobAuthenticated(stream pb.Disperser_Disperse | |||
} | |||
|
|||
func (s *DispersalServer) DisperseBlob(ctx context.Context, req *pb.DisperseBlobRequest) (*pb.DisperseBlobReply, error) { | |||
s.metrics.IncrementBlobRequestNum("DisperseBlob") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if it would be more useful to emit this at the end instead of the beginning. Because right now IncrementBlobRequestNum has status code = total. Total should equal "success + failures" anyways.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having a time series for "total" will make the monitoring query easy, just 1 - 500 / total
, and also more stable (no need to update when there is a new failure case to add here).
We currently have 4 failure codes; adding more will make the query itself more complex if "total" has to be expanded
operators/churner/metrics.go
Outdated
// IncrementRequestNum increments the number of successful requests | ||
func (g *Metrics) IncrementRequestNum(method string) { | ||
g.NumRequests.With(prometheus.Labels{ | ||
"status": "total", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment here, feel like we shouldn't track total because we can compute it from failed + success.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't we use the consistent grpc error codes in metrics as well instead of tracking the mapped http codes?
@@ -99,17 +100,18 @@ func (g *Metrics) HandleSuccessfulRequest(quorum string, blobBytes int, method s | |||
} | |||
|
|||
// IncrementFailedBlobRequestNum increments the number of failed blob requests | |||
func (g *Metrics) IncrementFailedBlobRequestNum(quorum string, method string) { | |||
func (g *Metrics) IncrementFailedBlobRequestNum(statusCode string, quorum string, method string) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should have these http status codes typed somewhere and use that type instead of taking raw string. Something like
type HttpStatusCode string
const (
SuccessHttpStatusCode = "200"
...
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The HTTP code is already quite standard (probably it's the standard)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is. My point isn't whether it's the standard or not. It's that we shouldn't take raw string in this method.
Why don't we use existing consts defined in standard library? https://go.dev/src/net/http/status.go
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same reason why quorum isn't a uint8 which you don't even need a library.
Either way can work, http is better understood in general. |
Thinking it again, it makes more sense to me to use gPRC code: these codes are not just used by us, but also users who may communicate back with errors; having them monitored with same codes will reduce the extra mental translation. The HTTP codes could be just used a documentation/annotation for understanding. Going to change it so. |
Ping, this pr needs to make progress |
Why are these changes needed?
Use the standard API status code to label the metrics, so we can correctly account for availability.
Checks