Add Prometheus metrics for S3 and IAM requests #71

anurag4DSB · 2024-12-19T11:30:22Z

This PR introduces comprehensive Prometheus metrics for S3 and IAM operations, providing detailed observability into these key workflows. It also adds unit tests to validate the metrics and includes documentation to guide users on their usage and integration.

Key Changes

Commit 1. Prometheus Metrics for S3 and IAM Operations
• Added S3RequestsTotal and S3RequestDuration metrics for S3 operations.
• Added IAMRequestsTotal and IAMRequestDuration metrics for IAM operations, categorized by method and status.
Commit 2 and 3. Metrics Instrumentation
• Instrumented S3 (CreateBucket, DeleteBucket) and IAM client methods to record metrics for both successful and failed operations.
• Instrumented IAM methods similarly.
Commit 4. Unit Tests
• Added tests to validate metrics generation for S3 and IAM operations, including error scenarios.
Commit 5. Documentation
• Added a metrics overview document for open-source users.
• Included examples of gRPC, S3, and IAM metrics output and links to resources for autogenerated metrics.

Impact

•	Provides enhanced observability and debugging capabilities for COSI operations.
•	Enables users to build dashboards and alerts using Prometheus for S3 and IAM operations.
•	Establishes the foundation for exposing metrics via Helm and Kustomize in [future PRs](https://scality.atlassian.net/browse/COSI-46).

- Introduces a new metrics package to handle Prometheus instrumentation. - Adds `RequestsTotal` as a custom metric to track COSI driver requests by method and status. - Implements `StartMetricsServer` to expose metrics at the configured HTTP endpoint. - Integrates Prometheus's `promhttp.Handler` for metrics scraping. - Uses constants from the `pkg/constants` package for the metrics path. Issue: COSI-65

- In main.go, allow graceful shutdown of the metrics server. - Added `metricsAddress` flag to configure the Prometheus metrics endpoint. - Integrated `metrics.StartMetricsServer` to expose metrics at the configured address. - Ensured graceful shutdown of the metrics server during service termination. - Updated the `run` function to include metrics server lifecycle management. - Maintains flexibility for metrics configuration using constants from the `pkg/constants` package. Issue: COSI-65

go-grpc-prometheus exports various metrics: - grpc_server_started_total: Count of RPCs started on the server by method. - grpc_server_handled_total: Count of RPCs completed on the server, regardless of success or failure. - grpc_server_handling_seconds_*: Histograms or summaries (if histograms are enabled) for tracking RPC handling duration. - grpc_server_msg_received_total: Number of messages received per RPC. - grpc_server_msg_sent_total: Number of messages sent per RPC. Issue: COSI-65

Issue: COSI-65

- Introduced `S3RequestsTotal` and `S3RequestDuration` to track total number of S3 requests and their duration, categorized by method and status. - Added `IAMRequestsTotal` and `IAMRequestDuration` for IAM request metrics, categorized similarly. - Registered all metrics, including `RequestsTotal`, with Prometheus. - Enhances observability of S3 and IAM operations, allowing detailed monitoring and performance analysis. Issue: COSI-19

- Instrumented `CreateBucket` and `DeleteBucket` methods to record Prometheus metrics. - Added tracking for: - Total number of S3 requests (`S3RequestsTotal`), categorized by method and status. - Duration of S3 requests (`S3RequestDuration`), categorized similarly - Ensures metrics capture both successful and failed S3 API calls. - Improves observability and debugging for S3 bucket operations. Issue: COSI-19

codecov · 2024-12-19T11:32:12Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 94.62%. Comparing base (aa85ea0) to head (6701bdc).

Additional details and impacted files

Files with missing lines	Coverage Δ
pkg/clients/iam/iam_client.go	`100.00% <100.00%> (ø)`
pkg/clients/s3/s3_client.go	`100.00% <100.00%> (ø)`
pkg/metrics/metrics.go	`100.00% <100.00%> (ø)`

Components	Coverage Δ
🏠 Main Package	`∅ <ø> (∅)`
🚗 Driver Package	`92.22% <ø> (ø)`
📡 gRPC Factory Package	`83.33% <ø> (ø)`
🔐 IAM Client Package	`100.00% <100.00%> (ø)`
🌐 S3 Client Package	`100.00% <100.00%> (ø)`
🔧 Util Package	`100.00% <ø> (ø)`
📊 Metrics Package	`100.00% <100.00%> (ø)`
🔖 Constants Package	`∅ <ø> (∅)`

@@                                   Coverage Diff                                   @@
##           feature/COSI-65-add-metrics-scrapable-by-prometheus      #71      +/-   ##
=======================================================================================
+ Coverage                                                93.74%   94.62%   +0.88%     
=======================================================================================
  Files                                                       10       10              
  Lines                                                      671      782     +111     
=======================================================================================
+ Hits                                                       629      740     +111     
  Misses                                                      36       36              
  Partials                                                     6        6

- Instrumented all IAM client methods to record Prometheus metrics: - `CreateUser` - `PutUserPolicy` - `CreateAccessKey` - `DeleteUserPolicy` - `ListAccessKeys` - `DeleteAccessKey` - `DeleteUser` - Added tracking for: - Total number of IAM requests (`IAMRequestsTotal`), categorized by method and status. - Duration of IAM requests (`IAMRequestDuration`), categorized similarly. - Improved observability for individual IAM API calls, including: - Error scenarios such as non-existent entities. - Metrics capture for both successful and failed operations. - Enhanced logging to complement metrics with contextual details. - Ensures metrics collection even in edge cases like manual deletions or non-critical errors. Issue: COSI-19

Issue: COSI-19

Lists the custom metrics and provides resurces for default auto generated metrics Issue: COSI-21

jonathan-gramain · 2024-12-20T00:55:23Z

pkg/metrics/metrics.go

 	prometheus.MustRegister(RequestsTotal)
+	prometheus.MustRegister(S3RequestsTotal)
+	prometheus.MustRegister(S3RequestDuration)
+	prometheus.MustRegister(IAMRequestsTotal)
+	prometheus.MustRegister(IAMRequestDuration)


Suggested change

prometheus.MustRegister(RequestsTotal)

prometheus.MustRegister(S3RequestsTotal)

prometheus.MustRegister(S3RequestDuration)

prometheus.MustRegister(IAMRequestsTotal)

prometheus.MustRegister(IAMRequestDuration)

prometheus.MustRegister(

RequestsTotal,

S3RequestsTotal,

S3RequestDuration,

IAMRequestsTotal,

IAMRequestDuration,

)

jonathan-gramain · 2024-12-20T00:59:14Z

pkg/metrics/metrics.go

@@ -19,10 +19,44 @@ var (
 		},
 		[]string{"method", "status"},
 	)
+	S3RequestsTotal = prometheus.NewCounterVec(
+		prometheus.CounterOpts{
+			Name: "s3_requests_total",


I think @BourgoisMickael also mentioned it in another PR: there should be a prefix to avoid metrics conflicts and ensure we know where they come from (even if the instance label may give a clue it may not be enough). Something like cosi_driver_ should do.

The NewXXXVec constructors can take Namespace and Subsystem arguments in their option object which is just a convenient way of adding a prefix to metrics, such as {namespace}_{subsystem}_foo_bar.

jonathan-gramain · 2024-12-20T01:28:13Z

pkg/clients/s3/s3_client.go

@@ -77,12 +82,33 @@ func (client *S3Client) CreateBucket(ctx context.Context, bucketName string, par
 	}

 	_, err := client.S3Service.CreateBucket(ctx, input)
+
+	duration := time.Since(start).Seconds()


In case it helps, the prometheus module provides a helper to observe durations that can make the code a tiny bit simpler: https://pkg.go.dev/github.com/prometheus/client_golang/prometheus#Timer

jonathan-gramain · 2024-12-20T01:30:14Z

pkg/constants/constants.go

+
+// Prometheus metrics status values
+const (
+	StatusSuccess = "success" // Status value for successful requests


Since those are global constants, it can be better to prefix them like MetricsStatusSuccess

What about have a specific small pkg metricsStatus to remove any prefix and so everything is not mixed in a global constants file ?

jonathan-gramain · 2024-12-20T01:34:26Z

pkg/clients/s3/s3_client.go

+		status = c.StatusError
+	}
+
+	metrics.S3RequestsTotal.WithLabelValues(method, status).Inc()


Since this metrics pattern seems to be repeated a lot, it may be worth having a wrapper around all request handlers that want metrics. I think it could be done using function closures, so the function signature is identical to the wrapper in all cases. But your call, just though it could help readability and maintenance and limit the risk of mistakes.

Could it be done with some kind of middleware around the aws sdk client: https://aws.github.io/aws-sdk-go-v2/docs/middleware/ ?

jonathan-gramain · 2024-12-20T01:53:00Z

pkg/metrics/metrics_test.go

+		metrics.S3RequestsTotal.WithLabelValues("CreateBucket", "success").Inc()
+		metrics.S3RequestsTotal.WithLabelValues("CreateBucket", "error").Add(2)
+
+		resp, err := http.Get(fmt.Sprintf("http://%s%s", addr, constants.MetricsPath))


FWIW the prometheus module contains test utilities for this kind of thing, here this function could be used: https://pkg.go.dev/github.com/prometheus/[email protected]/prometheus/testutil#ScrapeAndCompare.

It's totally up to you if you'd like to use them or prefer custom test code though (and in case you need an example, some of the metrics PRs for metadata migration use some of those test functions).

jonathan-gramain · 2024-12-20T02:02:45Z

pkg/metrics/metrics_test.go

+		const numRequests = 10
+		done := make(chan bool, numRequests)
+
+		for i := 0; i < numRequests; i++ {


You could use a sync.WaitGroup to wait for the 10 goroutines a bit more simply (https://pkg.go.dev/sync#WaitGroup).

Also, it's good practice to call defer GinkgoRecover() at the beginning of each goroutine to help ginkgo recover if some goroutines are stuck or panic.

BourgoisMickael · 2024-12-20T09:34:25Z

docs/metrics-overview.md

+## gRPC Metrics
+
+The COSI driver exposes gRPC server metrics to monitor RPC activity.


Suggested change

## gRPC Metrics

The COSI driver exposes gRPC server metrics to monitor RPC activity.

## gRPC Default Metrics

The COSI driver exposes default gRPC server metrics to monitor RPC activity.

Describe that those are default metrics, not implemented by us

BourgoisMickael · 2024-12-20T09:38:41Z

docs/metrics-overview.md

+| `s3_requests_total`              | Total number of S3 requests, categorized by method and status. | `method`, `status` |
+| `s3_request_duration_seconds`    | Duration of S3 requests in seconds, categorized by method and status. | `method`, `status` |


Type (Counter, Gauge, Histogram) is missing.

Also as they have the same labels, there will be duplication between:

s3_requests_total

s3_request_duration_seconds_count

In S3C to reduce cardinality when it can be high (for like s3 and vault) I started replacing usage of the total counter with the duration histogram _count when they have the same labels.

BourgoisMickael · 2024-12-20T09:41:46Z

pkg/clients/iam/iam_client.go

@@ -76,16 +78,29 @@ var InitIAMClient = func(params util.StorageClientParameters) (*IAMClient, error

 // CreateUser creates an IAM user with the specified name.
 func (client *IAMClient) CreateUser(ctx context.Context, userName string) error {
+	method := "CreateUser"
+	start := time.Now()


Might be irrelevant, but move the start after the input ust before the method call, to avoid timing the input var declaration ?

anurag4DSB added 9 commits December 18, 2024 12:43

COSI-75: e2e-test-for-brown-field-scenario

4d96040

COSI-75: re-organize-greenfield-custom-resources

08e8c70

Metrics package unit tests

a162a1a

Issue: COSI-65

Add metrics package to Codecov configuration

aa85ea0

Issue: COSI-65

anurag4DSB added 3 commits December 19, 2024 12:32

Added unit tests for IAM and S3 metrics

8973af0

Issue: COSI-19

Add documentation for metrics overview

6701bdc

Lists the custom metrics and provides resurces for default auto generated metrics Issue: COSI-21

anurag4DSB marked this pull request as ready for review December 19, 2024 12:28

jonathan-gramain reviewed Dec 20, 2024

View reviewed changes

BourgoisMickael reviewed Dec 20, 2024

View reviewed changes

anurag4DSB force-pushed the feature/COSI-65-add-metrics-scrapable-by-prometheus branch from aa85ea0 to e124149 Compare December 20, 2024 13:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Prometheus metrics for S3 and IAM requests #71

Add Prometheus metrics for S3 and IAM requests #71

anurag4DSB commented Dec 19, 2024 •

edited

Loading

codecov bot commented Dec 19, 2024 •

edited

Loading

jonathan-gramain Dec 20, 2024

jonathan-gramain Dec 20, 2024

jonathan-gramain Dec 20, 2024

jonathan-gramain Dec 20, 2024

BourgoisMickael Dec 20, 2024

jonathan-gramain Dec 20, 2024

BourgoisMickael Dec 20, 2024

jonathan-gramain Dec 20, 2024

jonathan-gramain Dec 20, 2024

BourgoisMickael Dec 20, 2024

BourgoisMickael Dec 20, 2024

BourgoisMickael Dec 20, 2024 •

edited

Loading

		## gRPC Metrics

		The COSI driver exposes gRPC server metrics to monitor RPC activity.

		\| `s3_requests_total` \| Total number of S3 requests, categorized by method and status. \| `method`, `status` \|
		\| `s3_request_duration_seconds` \| Duration of S3 requests in seconds, categorized by method and status. \| `method`, `status` \|

Add Prometheus metrics for S3 and IAM requests #71

Are you sure you want to change the base?

Add Prometheus metrics for S3 and IAM requests #71

Conversation

anurag4DSB commented Dec 19, 2024 • edited Loading

Key Changes

Impact

codecov bot commented Dec 19, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BourgoisMickael Dec 20, 2024 • edited Loading

Choose a reason for hiding this comment

anurag4DSB commented Dec 19, 2024 •

edited

Loading

codecov bot commented Dec 19, 2024 •

edited

Loading

BourgoisMickael Dec 20, 2024 •

edited

Loading