You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, the OpenChami project lacks a standardized approach for metrics observability across its microservices. Each service emit metrics in an ad-hoc way, if at all. This makes it difficult to:
Monitor the overall health of the system.
Identify performance bottlenecks or failures.
Correlate metrics across services to diagnose issues.
Set up alerts for critical system behavior changes.
Example Problem: If one microservice's response time spikes due to a dependency issue, it is challenging to trace this ripple effect through the system without consistent metrics collection.
What do you propose?
Integrate Prometheus metrics collection across all microservices in the OpenChami project. This involves:
Instrumenting each service:
Adding Prometheus client libraries to collect and expose metrics.
Instrumenting key points in the application (e.g., request latency, error rates, database query durations).
Using standardized metric naming conventions for consistency.
Creating instructions for central monitoring setup:
Deploying a Prometheus server to scrape and store metrics from all services.
Optionally integrating Grafana for visualization and dashboarding.
Working with local sites to collect prometheus metrics in the existing metrics plant
Defining SLIs (Service Level Indicators):
Establishing metrics to monitor critical aspects of each service, such as uptime, latency, and throughput.
Configuring alerts:
Setting up alert rules in Prometheus to notify the team about potential issues (e.g., high error rates or degraded response times).
What alternatives exist?
1. Stick to ad-hoc logging and monitoring:
Pros:
No immediate development effort required.
Relies on existing logging frameworks.
Cons:
Difficult to scale and correlate data across services.
Lacks standardization, leading to potential blind spots in monitoring.
2. Custom monitoring solution:
Pros:
Tailored to OpenChami's specific needs.
Cons:
Significant development and maintenance effort.
Reinventing the wheel when existing tools like Prometheus are proven and widely used.
Other Considerations?
Effort:
Adding Prometheus support will require development time, but this can be phased in. Start with critical services and expand gradually. See example for cloud-init
Security:
Exposed metrics endpoints should be secured (e.g., using authentication or limiting access to internal networks).
Scalability:
Prometheus scales well for most setups, but very large systems may require a clustered solution or additional tools like Thanos or Cortex.
This proposal is aimed at improving observability and system reliability, laying the foundation for proactive system monitoring and rapid incident response.
The text was updated successfully, but these errors were encountered:
+1 for Prometheus + Thanos.
Is there any intent to use Service Mesh (or any L5 tools) and/or eBPF( Cilium) in the future for OpenChami? These components may impact(drive) on how prometheus ( any open telemetry) tools are incorporated and leveraged.
We've been hesitant to include deployment tools like Kubernetes or a Service Mesh as default components in OpenCHAMI because it limits the usage of the tools in small or isolated deployments. I would certainly support someone creating a deployment recipe that includes them, but it cannot be mandatory.
Currently, the OpenChami project lacks a standardized approach for metrics observability across its microservices. Each service emit metrics in an ad-hoc way, if at all. This makes it difficult to:
Example Problem: If one microservice's response time spikes due to a dependency issue, it is challenging to trace this ripple effect through the system without consistent metrics collection.
What do you propose?
Integrate Prometheus metrics collection across all microservices in the OpenChami project. This involves:
Instrumenting each service:
Creating instructions for central monitoring setup:
Defining SLIs (Service Level Indicators):
Configuring alerts:
What alternatives exist?
1. Stick to ad-hoc logging and monitoring:
2. Custom monitoring solution:
Other Considerations?
Effort:
Security:
Scalability:
This proposal is aimed at improving observability and system reliability, laying the foundation for proactive system monitoring and rapid incident response.
The text was updated successfully, but these errors were encountered: