[RFD] Export Prometheus metrics for all microservices #66

alexlovelltroy · 2024-11-24T13:17:11Z

Currently, the OpenChami project lacks a standardized approach for metrics observability across its microservices. Each service emit metrics in an ad-hoc way, if at all. This makes it difficult to:

Monitor the overall health of the system.
Identify performance bottlenecks or failures.
Correlate metrics across services to diagnose issues.
Set up alerts for critical system behavior changes.

Example Problem: If one microservice's response time spikes due to a dependency issue, it is challenging to trace this ripple effect through the system without consistent metrics collection.

What do you propose?

Integrate Prometheus metrics collection across all microservices in the OpenChami project. This involves:

Instrumenting each service:
- Adding Prometheus client libraries to collect and expose metrics.
- Instrumenting key points in the application (e.g., request latency, error rates, database query durations).
- Using standardized metric naming conventions for consistency.
Creating instructions for central monitoring setup:
- Deploying a Prometheus server to scrape and store metrics from all services.
- Optionally integrating Grafana for visualization and dashboarding.
- Working with local sites to collect prometheus metrics in the existing metrics plant
Defining SLIs (Service Level Indicators):
- Establishing metrics to monitor critical aspects of each service, such as uptime, latency, and throughput.
Configuring alerts:
- Setting up alert rules in Prometheus to notify the team about potential issues (e.g., high error rates or degraded response times).

What alternatives exist?

1. Stick to ad-hoc logging and monitoring:

Pros:
- No immediate development effort required.
- Relies on existing logging frameworks.
Cons:
- Difficult to scale and correlate data across services.
- Lacks standardization, leading to potential blind spots in monitoring.

2. Custom monitoring solution:

Pros:
- Tailored to OpenChami's specific needs.
Cons:
- Significant development and maintenance effort.
- Reinventing the wheel when existing tools like Prometheus are proven and widely used.

Other Considerations?

Effort:

Adding Prometheus support will require development time, but this can be phased in. Start with critical services and expand gradually. See example for cloud-init

Security:

Exposed metrics endpoints should be secured (e.g., using authentication or limiting access to internal networks).

Scalability:

Prometheus scales well for most setups, but very large systems may require a clustered solution or additional tools like Thanos or Cortex.

This proposal is aimed at improving observability and system reliability, laying the foundation for proactive system monitoring and rapid incident response.

yogi4 · 2024-11-25T05:19:12Z

+1 for Prometheus + Thanos.
Is there any intent to use Service Mesh (or any L5 tools) and/or eBPF( Cilium) in the future for OpenChami? These components may impact(drive) on how prometheus ( any open telemetry) tools are incorporated and leveraged.

alexlovelltroy · 2024-11-25T11:40:07Z

We've been hesitant to include deployment tools like Kubernetes or a Service Mesh as default components in OpenCHAMI because it limits the usage of the tools in small or isolated deployments. I would certainly support someone creating a deployment recipe that includes them, but it cannot be mandatory.

alexlovelltroy added the rfd Request for Discussion label Nov 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFD] Export Prometheus metrics for all microservices #66

[RFD] Export Prometheus metrics for all microservices #66

alexlovelltroy commented Nov 24, 2024

yogi4 commented Nov 25, 2024

alexlovelltroy commented Nov 25, 2024

[RFD] Export Prometheus metrics for all microservices #66

[RFD] Export Prometheus metrics for all microservices #66

Comments

alexlovelltroy commented Nov 24, 2024

What do you propose?

What alternatives exist?

1. Stick to ad-hoc logging and monitoring:

2. Custom monitoring solution:

Other Considerations?

Effort:

Security:

Scalability:

yogi4 commented Nov 25, 2024

alexlovelltroy commented Nov 25, 2024