-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Prometheus metrics for S3 and IAM requests #71
base: feature/COSI-65-add-metrics-scrapable-by-prometheus
Are you sure you want to change the base?
Add Prometheus metrics for S3 and IAM requests #71
Conversation
- Introduces a new metrics package to handle Prometheus instrumentation. - Adds `RequestsTotal` as a custom metric to track COSI driver requests by method and status. - Implements `StartMetricsServer` to expose metrics at the configured HTTP endpoint. - Integrates Prometheus's `promhttp.Handler` for metrics scraping. - Uses constants from the `pkg/constants` package for the metrics path. Issue: COSI-65
- In main.go, allow graceful shutdown of the metrics server. - Added `metricsAddress` flag to configure the Prometheus metrics endpoint. - Integrated `metrics.StartMetricsServer` to expose metrics at the configured address. - Ensured graceful shutdown of the metrics server during service termination. - Updated the `run` function to include metrics server lifecycle management. - Maintains flexibility for metrics configuration using constants from the `pkg/constants` package. Issue: COSI-65
go-grpc-prometheus exports various metrics: - grpc_server_started_total: Count of RPCs started on the server by method. - grpc_server_handled_total: Count of RPCs completed on the server, regardless of success or failure. - grpc_server_handling_seconds_*: Histograms or summaries (if histograms are enabled) for tracking RPC handling duration. - grpc_server_msg_received_total: Number of messages received per RPC. - grpc_server_msg_sent_total: Number of messages sent per RPC. Issue: COSI-65
Issue: COSI-65
Issue: COSI-65
- Introduced `S3RequestsTotal` and `S3RequestDuration` to track total number of S3 requests and their duration, categorized by method and status. - Added `IAMRequestsTotal` and `IAMRequestDuration` for IAM request metrics, categorized similarly. - Registered all metrics, including `RequestsTotal`, with Prometheus. - Enhances observability of S3 and IAM operations, allowing detailed monitoring and performance analysis. Issue: COSI-19
- Instrumented `CreateBucket` and `DeleteBucket` methods to record Prometheus metrics. - Added tracking for: - Total number of S3 requests (`S3RequestsTotal`), categorized by method and status. - Duration of S3 requests (`S3RequestDuration`), categorized similarly - Ensures metrics capture both successful and failed S3 API calls. - Improves observability and debugging for S3 bucket operations. Issue: COSI-19
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files
@@ Coverage Diff @@
## feature/COSI-65-add-metrics-scrapable-by-prometheus #71 +/- ##
=======================================================================================
+ Coverage 93.74% 94.62% +0.88%
=======================================================================================
Files 10 10
Lines 671 782 +111
=======================================================================================
+ Hits 629 740 +111
Misses 36 36
Partials 6 6 |
- Instrumented all IAM client methods to record Prometheus metrics: - `CreateUser` - `PutUserPolicy` - `CreateAccessKey` - `DeleteUserPolicy` - `ListAccessKeys` - `DeleteAccessKey` - `DeleteUser` - Added tracking for: - Total number of IAM requests (`IAMRequestsTotal`), categorized by method and status. - Duration of IAM requests (`IAMRequestDuration`), categorized similarly. - Improved observability for individual IAM API calls, including: - Error scenarios such as non-existent entities. - Metrics capture for both successful and failed operations. - Enhanced logging to complement metrics with contextual details. - Ensures metrics collection even in edge cases like manual deletions or non-critical errors. Issue: COSI-19
Issue: COSI-19
Lists the custom metrics and provides resurces for default auto generated metrics Issue: COSI-21
pkg/metrics/metrics.go
Outdated
prometheus.MustRegister(RequestsTotal) | ||
prometheus.MustRegister(S3RequestsTotal) | ||
prometheus.MustRegister(S3RequestDuration) | ||
prometheus.MustRegister(IAMRequestsTotal) | ||
prometheus.MustRegister(IAMRequestDuration) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prometheus.MustRegister(RequestsTotal) | |
prometheus.MustRegister(S3RequestsTotal) | |
prometheus.MustRegister(S3RequestDuration) | |
prometheus.MustRegister(IAMRequestsTotal) | |
prometheus.MustRegister(IAMRequestDuration) | |
prometheus.MustRegister( | |
RequestsTotal, | |
S3RequestsTotal, | |
S3RequestDuration, | |
IAMRequestsTotal, | |
IAMRequestDuration, | |
) |
@@ -19,10 +19,44 @@ var ( | |||
}, | |||
[]string{"method", "status"}, | |||
) | |||
S3RequestsTotal = prometheus.NewCounterVec( | |||
prometheus.CounterOpts{ | |||
Name: "s3_requests_total", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think @BourgoisMickael also mentioned it in another PR: there should be a prefix to avoid metrics conflicts and ensure we know where they come from (even if the instance
label may give a clue it may not be enough). Something like cosi_driver_
should do.
The NewXXXVec
constructors can take Namespace
and Subsystem
arguments in their option object which is just a convenient way of adding a prefix to metrics, such as {namespace}_{subsystem}_foo_bar
.
@@ -77,12 +82,33 @@ func (client *S3Client) CreateBucket(ctx context.Context, bucketName string, par | |||
} | |||
|
|||
_, err := client.S3Service.CreateBucket(ctx, input) | |||
|
|||
duration := time.Since(start).Seconds() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case it helps, the prometheus module provides a helper to observe durations that can make the code a tiny bit simpler: https://pkg.go.dev/github.com/prometheus/client_golang/prometheus#Timer
|
||
// Prometheus metrics status values | ||
const ( | ||
StatusSuccess = "success" // Status value for successful requests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since those are global constants, it can be better to prefix them like MetricsStatusSuccess
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about have a specific small pkg metricsStatus
to remove any prefix and so everything is not mixed in a global constants file ?
status = c.StatusError | ||
} | ||
|
||
metrics.S3RequestsTotal.WithLabelValues(method, status).Inc() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this metrics pattern seems to be repeated a lot, it may be worth having a wrapper around all request handlers that want metrics. I think it could be done using function closures, so the function signature is identical to the wrapper in all cases. But your call, just though it could help readability and maintenance and limit the risk of mistakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could it be done with some kind of middleware around the aws sdk client: https://aws.github.io/aws-sdk-go-v2/docs/middleware/ ?
metrics.S3RequestsTotal.WithLabelValues("CreateBucket", "success").Inc() | ||
metrics.S3RequestsTotal.WithLabelValues("CreateBucket", "error").Add(2) | ||
|
||
resp, err := http.Get(fmt.Sprintf("http://%s%s", addr, constants.MetricsPath)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW the prometheus module contains test utilities for this kind of thing, here this function could be used: https://pkg.go.dev/github.com/prometheus/[email protected]/prometheus/testutil#ScrapeAndCompare.
It's totally up to you if you'd like to use them or prefer custom test code though (and in case you need an example, some of the metrics PRs for metadata migration use some of those test functions).
const numRequests = 10 | ||
done := make(chan bool, numRequests) | ||
|
||
for i := 0; i < numRequests; i++ { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could use a sync.WaitGroup
to wait for the 10 goroutines a bit more simply (https://pkg.go.dev/sync#WaitGroup).
Also, it's good practice to call defer GinkgoRecover()
at the beginning of each goroutine to help ginkgo recover if some goroutines are stuck or panic.
## gRPC Metrics | ||
|
||
The COSI driver exposes gRPC server metrics to monitor RPC activity. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
## gRPC Metrics | |
The COSI driver exposes gRPC server metrics to monitor RPC activity. | |
## gRPC Default Metrics | |
The COSI driver exposes default gRPC server metrics to monitor RPC activity. |
Describe that those are default metrics, not implemented by us
| `s3_requests_total` | Total number of S3 requests, categorized by method and status. | `method`, `status` | | ||
| `s3_request_duration_seconds` | Duration of S3 requests in seconds, categorized by method and status. | `method`, `status` | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Type (Counter
, Gauge
, Histogram
) is missing.
Also as they have the same labels, there will be duplication between:
s3_requests_total
s3_request_duration_seconds_count
In S3C to reduce cardinality when it can be high (for like s3 and vault) I started replacing usage of the total counter with the duration histogram _count
when they have the same labels.
@@ -76,16 +78,29 @@ var InitIAMClient = func(params util.StorageClientParameters) (*IAMClient, error | |||
|
|||
// CreateUser creates an IAM user with the specified name. | |||
func (client *IAMClient) CreateUser(ctx context.Context, userName string) error { | |||
method := "CreateUser" | |||
start := time.Now() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be irrelevant, but move the start after the input ust before the method call, to avoid timing the input var declaration ?
aa85ea0
to
e124149
Compare
This PR introduces comprehensive Prometheus metrics for S3 and IAM operations, providing detailed observability into these key workflows. It also adds unit tests to validate the metrics and includes documentation to guide users on their usage and integration.
Key Changes
• Added S3RequestsTotal and S3RequestDuration metrics for S3 operations.
• Added IAMRequestsTotal and IAMRequestDuration metrics for IAM operations, categorized by method and status.
• Instrumented S3 (CreateBucket, DeleteBucket) and IAM client methods to record metrics for both successful and failed operations.
• Instrumented IAM methods similarly.
• Added tests to validate metrics generation for S3 and IAM operations, including error scenarios.
• Added a metrics overview document for open-source users.
• Included examples of gRPC, S3, and IAM metrics output and links to resources for autogenerated metrics.
Impact