Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

COSI-65: Add Prometheus Metrics Support/Instrumentation to COSI Driver #69

Closed

Conversation

anurag4DSB
Copy link
Collaborator

@anurag4DSB anurag4DSB commented Dec 19, 2024

This PR introduces comprehensive Prometheus metrics support for the COSI driver, including metrics instrumentation, integration, and unit tests. The changes are grouped into four key commits, each addressing a distinct aspect of the implementation. Reviewers are encouraged to follow the commit story for a structured understanding of the changes.
Documentation coming soon

Commit Summary:

  1. Add metrics package for Prometheus instrumentation
    • Introduces the metrics package for Prometheus instrumentation.
    • Adds a custom RequestsTotal metric to track COSI driver requests by method and status.
    • Implements StartMetricsServer to expose metrics at an HTTP endpoint.
  2. Integrate Prometheus metrics server in COSI driver
    • Adds a metricsAddress flag for configuring the metrics endpoint.
    • Manages the metrics server lifecycle with graceful shutdown support.
    • Integrates metrics.StartMetricsServer into the driver’s main runtime.
  3. Instrumented gRPC server with Prometheus exporter
    • Adds gRPC metrics such as RPC counts, handling duration, and message totals using go-grpc-prometheus.
  4. Metrics package unit tests
    • Adds unit tests for the metrics package, ensuring complete coverage of StartMetricsServer and StartMetricsServerWithListener.
  5. Update Codecov config

Request for Reviewers:

Please follow the commit story to understand the changes in detail. Focus areas include:
• Metrics integration with the gRPC server.
• Implementation of StartMetricsServer and its lifecycle management.
• Completeness and accuracy of unit tests. (comments are very intentional for Contri-X as I am a bit new to prom unit testing, helping my future self.)

Issue:
Resolves: COSI-65

Example

# curl http://localhost:8080/metrics
# HELP go_cgo_go_to_c_calls_calls_total Count of calls made from Go to C by the current process.
# TYPE go_cgo_go_to_c_calls_calls_total counter
go_cgo_go_to_c_calls_calls_total 0
# HELP go_cpu_classes_gc_mark_assist_cpu_seconds_total Estimated total CPU time goroutines spent performing GC tasks to assist the GC and prevent it from falling behind the application. This metric is an overestimate, and not directly comparable to system CPU time measurements. Compare only with other /cpu/classes metrics.
# TYPE go_cpu_classes_gc_mark_assist_cpu_seconds_total counter
go_cpu_classes_gc_mark_assist_cpu_seconds_total 0.001221251
.
.
.
grpc_server_started_total{grpc_method="DriverCreateBucket",grpc_service="cosi.v1alpha1.Provisioner",grpc_type="unary"} 0
grpc_server_started_total{grpc_method="DriverDeleteBucket",grpc_service="cosi.v1alpha1.Provisioner",grpc_type="unary"} 0
grpc_server_started_total{grpc_method="DriverGetInfo",grpc_service="cosi.v1alpha1.Identity",grpc_type="unary"} 6
grpc_server_started_total{grpc_method="DriverGrantBucketAccess",grpc_service="cosi.v1alpha1.Provisioner",grpc_type="unary"} 0
grpc_server_started_total{grpc_method="DriverRevokeBucketAccess",grpc_service="cosi.v1alpha1.Provisioner",grpc_type="unary"} 0
.
.
.
# HELP promhttp_metric_handler_requests_in_flight Current number of scrapes being served.
# TYPE promhttp_metric_handler_requests_in_flight gauge
promhttp_metric_handler_requests_in_flight 1
# HELP promhttp_metric_handler_requests_total Total number of scrapes by HTTP status code.
# TYPE promhttp_metric_handler_requests_total counter
promhttp_metric_handler_requests_total{code="200"} 37
promhttp_metric_handler_requests_total{code="500"} 0
promhttp_metric_handler_requests_total{code="503"} 0

Copy link

codecov bot commented Dec 19, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 93.74%. Comparing base (a9cf51d) to head (e124149).

Additional details and impacted files

Impacted file tree graph

Files with missing lines Coverage Δ
pkg/grpcfactory/server.go 87.80% <100.00%> (+4.47%) ⬆️
pkg/metrics/metrics.go 100.00% <100.00%> (ø)
Components Coverage Δ
🏠 Main Package ∅ <ø> (∅)
🚗 Driver Package 92.22% <ø> (ø)
📡 gRPC Factory Package 83.33% <100.00%> (+1.68%) ⬆️
🔐 IAM Client Package 100.00% <ø> (ø)
🌐 S3 Client Package 100.00% <ø> (ø)
🔧 Util Package 100.00% <ø> (ø)
📊 Metrics Package 100.00% <100.00%> (∅)
🔖 Constants Package ∅ <ø> (∅)
@@            Coverage Diff             @@
##             main      #69      +/-   ##
==========================================
+ Coverage   93.40%   93.74%   +0.33%     
==========================================
  Files           9       10       +1     
  Lines         637      671      +34     
==========================================
+ Hits          595      629      +34     
  Misses         36       36              
  Partials        6        6              

@anurag4DSB anurag4DSB force-pushed the feature/COSI-65-add-metrics-scrapable-by-prometheus branch 2 times, most recently from b0108f0 to 191d005 Compare December 19, 2024 08:22
@anurag4DSB anurag4DSB marked this pull request as ready for review December 19, 2024 08:27
@anurag4DSB anurag4DSB changed the title Add metrics package for Prometheus instrumentation Add Prometheus Metrics Support/Instrumentation to COSI Driver Dec 19, 2024
@anurag4DSB anurag4DSB changed the title Add Prometheus Metrics Support/Instrumentation to COSI Driver COSI-65: Add Prometheus Metrics Support/Instrumentation to COSI Driver Dec 19, 2024
@anurag4DSB anurag4DSB force-pushed the feature/COSI-65-add-metrics-scrapable-by-prometheus branch from 7965f66 to aa85ea0 Compare December 19, 2024 08:37
@@ -50,5 +50,6 @@ func main() {
// Call the run function (defined in cmd.go)
if err := run(ctx); err != nil {
klog.ErrorS(err, "Scality COSI driver encountered an error, shutting down")
os.Exit(1)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graceful exit to shutdown the metrics server as well.

@@ -10,3 +10,9 @@ const (
LvlDebug // 4 - Debug-level logs, tricky logic areas
LvlTrace // 5 - Trace-level logs, detailed troubleshooting context
)

// Service initialization constants
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be expanded soon with all service initialization constants, to make sure we don't have magic numbers/strings


go func() {
klog.InfoS("Starting Prometheus metrics server", "address", listener.Addr().String())
if err := srv.Serve(listener); err != nil && err != http.ErrServerClosed {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my own edification, why is ErrServerClosed not considered an error?

Copy link
Collaborator Author

@anurag4DSB anurag4DSB Dec 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The http.ErrServerClosed error is not considered an error in this context because it is a normal part of the lifecycle of an HTTP server in Go.

Bacially, the http.ErrServerClosed is returned by the http.Server.ListenAndServe method when the server is shut down using the Shutdown or Close methods. This indicates a graceful shutdown, which is expected behavior, not a failure. Treating http.ErrServerClosed as a non-error ensures that the shutdown process doesn’t log misleading or unnecessary error messages.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep makes total sense, thanks!

@@ -50,5 +50,6 @@ func main() {
// Call the run function (defined in cmd.go)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't write a comment above this line, but at main.go:42: doesn't ctx.Done() close immediately when cancel is called? I don't think checking that is worth it. The timeout is good though. We could however add another read on the sigs channel in the select if we wanted to force shutdown on multiple SIGINT.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right the select block is indeed redundant.
When cancel() is called, ctx.Done() immediately closes, and the select statement in the goroutine will always choose the <-ctx.Done() case first. This makes the select block somewhat redundant in its current form.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am thinking to change it to something like this
What do you think?

go func() {
    sig := <-sigs
    klog.InfoS("Signal received", "type", sig)
    cancel()

    klog.InfoS("Scality COSI driver shutdown initiated successfully, context canceled")

    select {
    case sig = <-sigs:
        klog.ErrorS(nil, "Force shutdown due to repeated signal", "type", sig)
        os.Exit(1)
    case <-time.After(shutdownTimeout):
        klog.ErrorS(nil, "Force shutdown due to timeout", "timeout", shutdownTimeout)
        os.Exit(1)
    }
}()

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is good 👍

Perhaps the first log could say something like "Initiating graceful shutdown, repeat signal to force shutdown"?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay I will create another PR for this, its out of scope for this one. Thanks

Name: "cosi_requests_total",
Help: "Total number of requests handled by the COSI driver.",
},
[]string{"method", "status"},
Copy link

@BourgoisMickael BourgoisMickael Dec 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually for http request labels in other components we use:

  • method: HTTP method
  • code: response code (instead of status)
  • action: S3 action (optionaly)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a gRPC service and not HTTP, hence the difference.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I see.

In your other PR doc I see the possible values, maybe you can put a comment here to describe possible values for those labels

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And should it be called *_grpc_requests_total then to be clear ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All grpc metrics are generated automatically, this is a placeholder for custom metrics.
I can remove this method its that better, but I wanted to keep it for future use.
What I had inmind was have total requests for grpc and HTTP calls, but needs to be discussed with @davidmercier-scality so left it for now.

Its not being used to generate any custom metrics as of now even if further PRs.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But indeed we can use the cosi_driver prefix for this.

Comment on lines +16 to +17
MetricsPath = "/metrics"
MetricsAddress = ":8080"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this go directly in the metrics.go file ? It's not going to be used any where else ?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the address use only local interface 127.0.0.1 by default ? And the port is already used by cloudserver, can we pick another one that's not used by other components ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes you are right it will go in the metrics.go, the GO way is to keep it closer indeed if not being re-used.

To address your concerns:
1. Service Context: This service is not part of RING and will be deployed on customers’ Kubernetes clusters. Since it is deployed externally, it won’t conflict with any port numbers within RING.
2. Deployment Setup:
• The container is deployed as a Kubernetes pod, and the metrics route is exposed via a Kubernetes service with its own unique cluster IP.
• End-users can modify the exposed port via deployment configurations or Helm charts.
3. Port Conflict Analysis:
• Even without modifying the port, each Kubernetes service gets a unique IP.
• Since only one HTTP server (for metrics) runs within the pod, conflicts are highly unlikely within the pod.

Here’s an example for clarity:

✗ kubectl get svc --all-namespaces  
NAMESPACE                NAME                          TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)                  AGE  
default                  kubernetes                    ClusterIP   10.96.0.1       <none>        443/TCP                  17h  
kube-system              kube-dns                      ClusterIP   10.96.0.10      <none>        53/UDP,53/TCP,9153/TCP   17h  
kubernetes-dashboard     dashboard-metrics-scraper     ClusterIP   10.103.11.103   <none>        8000/TCP                 17h  
kubernetes-dashboard     kubernetes-dashboard          ClusterIP   10.96.45.42     <none>        80/TCP                   17h  
scality-object-storage   scality-cosi-driver-metrics   ClusterIP   10.97.169.22    <none>        8080/TCP                 17h  

go.mod Outdated
@@ -5,8 +5,10 @@ go 1.22.6
require (
github.com/aws/aws-sdk-go-v2/credentials v1.17.47
github.com/aws/smithy-go v1.22.1
github.com/grpc-ecosystem/go-grpc-prometheus v1.2.0

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The github says it's deprecated in favor of go-grpc-middleware

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you can check if you can prefix those metrics with s3_cosi_ so we can easily identify them

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not an S3 service so we will prefix it with cosi_driver for custom metrics but for default grpc metrics we should keep the convention of grpc_, just like we do for HTTP protocol.

Metrics such as grpc_server_started_total convey that these are standard gRPC server metrics, helping tools or dashboards that expect gRPC naming conventions to process them without requiring additional configuration.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will check this out go-grpc-middleware

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I see.

Maybe a config option like metrics_prefix if client wants to add some custom prefix for cosi. Otherwise it can still be differentiated with the job name and it might not matter at all.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For custom metrics we can add that
So default can be cosi_driver prefix which is configurable
But I would like to keep grpc metrics in standard format .

@anurag4DSB anurag4DSB force-pushed the bugfix/COSI-74-remove-silent-errors branch from 89e0dfc to a3e7337 Compare December 20, 2024 13:19
Base automatically changed from bugfix/COSI-74-remove-silent-errors to main December 20, 2024 13:30
- Introduces a new metrics package to handle Prometheus instrumentation.
- Adds `RequestsTotal` as a custom metric to track COSI driver requests
  by method and status.
- Implements `StartMetricsServer` to expose metrics at the configured
  HTTP endpoint.
- Integrates Prometheus's `promhttp.Handler` for metrics scraping.
- Uses constants from the `pkg/constants` package for the metrics path.

Issue: COSI-65
- In main.go, allow graceful shutdown of the metrics server.
- Added `metricsAddress` flag to configure the Prometheus metrics endpoint.
- Integrated `metrics.StartMetricsServer` to expose metrics at the
  configured address.
- Ensured graceful shutdown of the metrics server during service
  termination.
- Updated the `run` function to include metrics server lifecycle
  management.
- Maintains flexibility for metrics configuration using constants from
  the `pkg/constants` package.

Issue: COSI-65
go-grpc-prometheus exports various metrics:
- grpc_server_started_total: Count of RPCs started on the server by
  method.
- grpc_server_handled_total: Count of RPCs completed on the server,
  regardless of success or failure.
- grpc_server_handling_seconds_*: Histograms or summaries (if histograms
  are enabled) for tracking RPC handling duration.
- grpc_server_msg_received_total: Number of messages received per RPC.
- grpc_server_msg_sent_total: Number of messages sent per RPC.

Issue: COSI-65
@anurag4DSB anurag4DSB force-pushed the feature/COSI-65-add-metrics-scrapable-by-prometheus branch from aa85ea0 to e124149 Compare December 20, 2024 13:44
})

AfterEach(func() {
// Clean up the Unix socket file
socketPath := strings.TrimPrefix(address, "unix://")
if err := os.Remove(socketPath); err != nil && !os.IsNotExist(err) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be rather done in server code, before listen ?

I suppose the unix socket is cleaned up by a graceful shutdown, but if there is any crash that prevents the process from cleaning that file it might prevent restart of the server ?

@anurag4DSB
Copy link
Collaborator Author

closing in favor of #83

@anurag4DSB anurag4DSB closed this Jan 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants