Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Prometheus metrics for S3 and IAM requests #71

Open
wants to merge 12 commits into
base: feature/COSI-65-add-metrics-scrapable-by-prometheus
Choose a base branch
from

Conversation

anurag4DSB
Copy link
Collaborator

@anurag4DSB anurag4DSB commented Dec 19, 2024

This PR introduces comprehensive Prometheus metrics for S3 and IAM operations, providing detailed observability into these key workflows. It also adds unit tests to validate the metrics and includes documentation to guide users on their usage and integration.

Key Changes

  • Commit 1. Prometheus Metrics for S3 and IAM Operations
    • Added S3RequestsTotal and S3RequestDuration metrics for S3 operations.
    • Added IAMRequestsTotal and IAMRequestDuration metrics for IAM operations, categorized by method and status.
  • Commit 2 and 3. Metrics Instrumentation
    • Instrumented S3 (CreateBucket, DeleteBucket) and IAM client methods to record metrics for both successful and failed operations.
    • Instrumented IAM methods similarly.
  • Commit 4. Unit Tests
    • Added tests to validate metrics generation for S3 and IAM operations, including error scenarios.
  • Commit 5. Documentation
    • Added a metrics overview document for open-source users.
    • Included examples of gRPC, S3, and IAM metrics output and links to resources for autogenerated metrics.

Impact

•	Provides enhanced observability and debugging capabilities for COSI operations.
•	Enables users to build dashboards and alerts using Prometheus for S3 and IAM operations.
•	Establishes the foundation for exposing metrics via Helm and Kustomize in [future PRs](https://scality.atlassian.net/browse/COSI-46).

- Introduces a new metrics package to handle Prometheus instrumentation.
- Adds `RequestsTotal` as a custom metric to track COSI driver requests
  by method and status.
- Implements `StartMetricsServer` to expose metrics at the configured
  HTTP endpoint.
- Integrates Prometheus's `promhttp.Handler` for metrics scraping.
- Uses constants from the `pkg/constants` package for the metrics path.

Issue: COSI-65
- In main.go, allow graceful shutdown of the metrics server.
- Added `metricsAddress` flag to configure the Prometheus metrics endpoint.
- Integrated `metrics.StartMetricsServer` to expose metrics at the
  configured address.
- Ensured graceful shutdown of the metrics server during service
  termination.
- Updated the `run` function to include metrics server lifecycle
  management.
- Maintains flexibility for metrics configuration using constants from
  the `pkg/constants` package.

Issue: COSI-65
go-grpc-prometheus exports various metrics:
- grpc_server_started_total: Count of RPCs started on the server by
  method.
- grpc_server_handled_total: Count of RPCs completed on the server,
  regardless of success or failure.
- grpc_server_handling_seconds_*: Histograms or summaries (if histograms
  are enabled) for tracking RPC handling duration.
- grpc_server_msg_received_total: Number of messages received per RPC.
- grpc_server_msg_sent_total: Number of messages sent per RPC.

Issue: COSI-65
- Introduced `S3RequestsTotal` and `S3RequestDuration` to track total
  number of S3 requests and their duration, categorized by method and
  status.
- Added `IAMRequestsTotal` and `IAMRequestDuration` for IAM request
  metrics, categorized similarly.
- Registered all metrics, including `RequestsTotal`, with Prometheus.
- Enhances observability of S3 and IAM operations, allowing detailed
  monitoring and performance analysis.

Issue: COSI-19
- Instrumented `CreateBucket` and `DeleteBucket` methods to record
  Prometheus metrics.
- Added tracking for:
  - Total number of S3 requests (`S3RequestsTotal`), categorized by
    method and status.
  - Duration of S3 requests (`S3RequestDuration`), categorized similarly
- Ensures metrics capture both successful and failed S3 API calls.
- Improves observability and debugging for S3 bucket operations.

Issue: COSI-19
Copy link

codecov bot commented Dec 19, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 94.62%. Comparing base (aa85ea0) to head (6701bdc).

Additional details and impacted files

Impacted file tree graph

Files with missing lines Coverage Δ
pkg/clients/iam/iam_client.go 100.00% <100.00%> (ø)
pkg/clients/s3/s3_client.go 100.00% <100.00%> (ø)
pkg/metrics/metrics.go 100.00% <100.00%> (ø)
Components Coverage Δ
🏠 Main Package ∅ <ø> (∅)
🚗 Driver Package 92.22% <ø> (ø)
📡 gRPC Factory Package 83.33% <ø> (ø)
🔐 IAM Client Package 100.00% <100.00%> (ø)
🌐 S3 Client Package 100.00% <100.00%> (ø)
🔧 Util Package 100.00% <ø> (ø)
📊 Metrics Package 100.00% <100.00%> (ø)
🔖 Constants Package ∅ <ø> (∅)
@@                                   Coverage Diff                                   @@
##           feature/COSI-65-add-metrics-scrapable-by-prometheus      #71      +/-   ##
=======================================================================================
+ Coverage                                                93.74%   94.62%   +0.88%     
=======================================================================================
  Files                                                       10       10              
  Lines                                                      671      782     +111     
=======================================================================================
+ Hits                                                       629      740     +111     
  Misses                                                      36       36              
  Partials                                                     6        6              

- Instrumented all IAM client methods to record Prometheus metrics:
  - `CreateUser`
  - `PutUserPolicy`
  - `CreateAccessKey`
  - `DeleteUserPolicy`
  - `ListAccessKeys`
  - `DeleteAccessKey`
  - `DeleteUser`
- Added tracking for:
  - Total number of IAM requests (`IAMRequestsTotal`), categorized by
    method and status.
  - Duration of IAM requests (`IAMRequestDuration`), categorized
    similarly.
- Improved observability for individual IAM API calls, including:
  - Error scenarios such as non-existent entities.
  - Metrics capture for both successful and failed operations.
- Enhanced logging to complement metrics with contextual details.
- Ensures metrics collection even in edge cases like manual deletions
  or non-critical errors.

Issue: COSI-19
Lists the custom metrics and provides resurces for default auto
generated metrics

Issue: COSI-21
@anurag4DSB anurag4DSB marked this pull request as ready for review December 19, 2024 12:28
Comment on lines 55 to 59
prometheus.MustRegister(RequestsTotal)
prometheus.MustRegister(S3RequestsTotal)
prometheus.MustRegister(S3RequestDuration)
prometheus.MustRegister(IAMRequestsTotal)
prometheus.MustRegister(IAMRequestDuration)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
prometheus.MustRegister(RequestsTotal)
prometheus.MustRegister(S3RequestsTotal)
prometheus.MustRegister(S3RequestDuration)
prometheus.MustRegister(IAMRequestsTotal)
prometheus.MustRegister(IAMRequestDuration)
prometheus.MustRegister(
RequestsTotal,
S3RequestsTotal,
S3RequestDuration,
IAMRequestsTotal,
IAMRequestDuration,
)

@@ -19,10 +19,44 @@ var (
},
[]string{"method", "status"},
)
S3RequestsTotal = prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "s3_requests_total",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @BourgoisMickael also mentioned it in another PR: there should be a prefix to avoid metrics conflicts and ensure we know where they come from (even if the instance label may give a clue it may not be enough). Something like cosi_driver_ should do.

The NewXXXVec constructors can take Namespace and Subsystem arguments in their option object which is just a convenient way of adding a prefix to metrics, such as {namespace}_{subsystem}_foo_bar.

@@ -77,12 +82,33 @@ func (client *S3Client) CreateBucket(ctx context.Context, bucketName string, par
}

_, err := client.S3Service.CreateBucket(ctx, input)

duration := time.Since(start).Seconds()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case it helps, the prometheus module provides a helper to observe durations that can make the code a tiny bit simpler: https://pkg.go.dev/github.com/prometheus/client_golang/prometheus#Timer


// Prometheus metrics status values
const (
StatusSuccess = "success" // Status value for successful requests

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since those are global constants, it can be better to prefix them like MetricsStatusSuccess

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about have a specific small pkg metricsStatus to remove any prefix and so everything is not mixed in a global constants file ?

status = c.StatusError
}

metrics.S3RequestsTotal.WithLabelValues(method, status).Inc()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this metrics pattern seems to be repeated a lot, it may be worth having a wrapper around all request handlers that want metrics. I think it could be done using function closures, so the function signature is identical to the wrapper in all cases. But your call, just though it could help readability and maintenance and limit the risk of mistakes.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could it be done with some kind of middleware around the aws sdk client: https://aws.github.io/aws-sdk-go-v2/docs/middleware/ ?

metrics.S3RequestsTotal.WithLabelValues("CreateBucket", "success").Inc()
metrics.S3RequestsTotal.WithLabelValues("CreateBucket", "error").Add(2)

resp, err := http.Get(fmt.Sprintf("http://%s%s", addr, constants.MetricsPath))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW the prometheus module contains test utilities for this kind of thing, here this function could be used: https://pkg.go.dev/github.com/prometheus/[email protected]/prometheus/testutil#ScrapeAndCompare.

It's totally up to you if you'd like to use them or prefer custom test code though (and in case you need an example, some of the metrics PRs for metadata migration use some of those test functions).

const numRequests = 10
done := make(chan bool, numRequests)

for i := 0; i < numRequests; i++ {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could use a sync.WaitGroup to wait for the 10 goroutines a bit more simply (https://pkg.go.dev/sync#WaitGroup).

Also, it's good practice to call defer GinkgoRecover() at the beginning of each goroutine to help ginkgo recover if some goroutines are stuck or panic.

Comment on lines +11 to +13
## gRPC Metrics

The COSI driver exposes gRPC server metrics to monitor RPC activity.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
## gRPC Metrics
The COSI driver exposes gRPC server metrics to monitor RPC activity.
## gRPC Default Metrics
The COSI driver exposes default gRPC server metrics to monitor RPC activity.

Describe that those are default metrics, not implemented by us

Comment on lines +36 to +37
| `s3_requests_total` | Total number of S3 requests, categorized by method and status. | `method`, `status` |
| `s3_request_duration_seconds` | Duration of S3 requests in seconds, categorized by method and status. | `method`, `status` |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Type (Counter, Gauge, Histogram) is missing.

Also as they have the same labels, there will be duplication between:

  • s3_requests_total
  • s3_request_duration_seconds_count

In S3C to reduce cardinality when it can be high (for like s3 and vault) I started replacing usage of the total counter with the duration histogram _count when they have the same labels.

@@ -76,16 +78,29 @@ var InitIAMClient = func(params util.StorageClientParameters) (*IAMClient, error

// CreateUser creates an IAM user with the specified name.
func (client *IAMClient) CreateUser(ctx context.Context, userName string) error {
method := "CreateUser"
start := time.Now()
Copy link

@BourgoisMickael BourgoisMickael Dec 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be irrelevant, but move the start after the input ust before the method call, to avoid timing the input var declaration ?

@anurag4DSB anurag4DSB force-pushed the feature/COSI-65-add-metrics-scrapable-by-prometheus branch from aa85ea0 to e124149 Compare December 20, 2024 13:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants