Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFD] Export Prometheus metrics for all microservices #66

Open
alexlovelltroy opened this issue Nov 24, 2024 · 2 comments
Open

[RFD] Export Prometheus metrics for all microservices #66

alexlovelltroy opened this issue Nov 24, 2024 · 2 comments
Labels
rfd Request for Discussion

Comments

@alexlovelltroy
Copy link
Member

Currently, the OpenChami project lacks a standardized approach for metrics observability across its microservices. Each service emit metrics in an ad-hoc way, if at all. This makes it difficult to:

  • Monitor the overall health of the system.
  • Identify performance bottlenecks or failures.
  • Correlate metrics across services to diagnose issues.
  • Set up alerts for critical system behavior changes.

Example Problem: If one microservice's response time spikes due to a dependency issue, it is challenging to trace this ripple effect through the system without consistent metrics collection.

What do you propose?

Integrate Prometheus metrics collection across all microservices in the OpenChami project. This involves:

  1. Instrumenting each service:

    • Adding Prometheus client libraries to collect and expose metrics.
    • Instrumenting key points in the application (e.g., request latency, error rates, database query durations).
    • Using standardized metric naming conventions for consistency.
  2. Creating instructions for central monitoring setup:

    • Deploying a Prometheus server to scrape and store metrics from all services.
    • Optionally integrating Grafana for visualization and dashboarding.
    • Working with local sites to collect prometheus metrics in the existing metrics plant
  3. Defining SLIs (Service Level Indicators):

    • Establishing metrics to monitor critical aspects of each service, such as uptime, latency, and throughput.
  4. Configuring alerts:

    • Setting up alert rules in Prometheus to notify the team about potential issues (e.g., high error rates or degraded response times).

What alternatives exist?

1. Stick to ad-hoc logging and monitoring:

  • Pros:
    • No immediate development effort required.
    • Relies on existing logging frameworks.
  • Cons:
    • Difficult to scale and correlate data across services.
    • Lacks standardization, leading to potential blind spots in monitoring.

2. Custom monitoring solution:

  • Pros:
    • Tailored to OpenChami's specific needs.
  • Cons:
    • Significant development and maintenance effort.
    • Reinventing the wheel when existing tools like Prometheus are proven and widely used.

Other Considerations?

Effort:

  • Adding Prometheus support will require development time, but this can be phased in. Start with critical services and expand gradually. See example for cloud-init

Security:

  • Exposed metrics endpoints should be secured (e.g., using authentication or limiting access to internal networks).

Scalability:

  • Prometheus scales well for most setups, but very large systems may require a clustered solution or additional tools like Thanos or Cortex.

This proposal is aimed at improving observability and system reliability, laying the foundation for proactive system monitoring and rapid incident response.

@alexlovelltroy alexlovelltroy added the rfd Request for Discussion label Nov 24, 2024
@yogi4
Copy link

yogi4 commented Nov 25, 2024

+1 for Prometheus + Thanos.
Is there any intent to use Service Mesh (or any L5 tools) and/or eBPF( Cilium) in the future for OpenChami? These components may impact(drive) on how prometheus ( any open telemetry) tools are incorporated and leveraged.

@alexlovelltroy
Copy link
Member Author

We've been hesitant to include deployment tools like Kubernetes or a Service Mesh as default components in OpenCHAMI because it limits the usage of the tools in small or isolated deployments. I would certainly support someone creating a deployment recipe that includes them, but it cannot be mandatory.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
rfd Request for Discussion
Projects
None yet
Development

No branches or pull requests

2 participants