Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Observability] Open-source systems monitoring and alerting #418

Open
tomsmith8 opened this issue Nov 28, 2024 · 2 comments
Open

[Observability] Open-source systems monitoring and alerting #418

tomsmith8 opened this issue Nov 28, 2024 · 2 comments
Assignees

Comments

@tomsmith8
Copy link
Contributor

tomsmith8 commented Nov 28, 2024

Ticket: Improve Monitoring for Jarvis Endpoint Timeouts

Overview

We are experiencing timeouts on certain Jarvis endpoints. The root cause has not been identified yet. To address this issue, we propose enhancing our monitoring capabilities using additional OSS tools.


Proposed Actions

  1. Investigate Prometheus and Grafana

    • Prometheus: A powerful open-source monitoring solution.
    • Grafana: A visualization and dashboarding tool.
  2. Integrate Node Exporter for Hardware Monitoring

    • Add Prometheus Node Exporter to swarm instances.
    • This will enable detailed hardware monitoring and provide metrics for Prometheus to collect.
  3. Consider Enhancing Superadmin

    • Evaluate the possibility of adding monitoring functionality directly into the Superadmin tool.

Next Steps

  • Install and configure Prometheus and Grafana on swarm instances.
  • Integrate Node Exporter for hardware metrics.
  • Set up dashboards in Grafana for endpoint performance and hardware monitoring.
  • Determine feasibility and effort required for extending Superadmin with monitoring features.

Hardware Monitoring (Details)

  • Node Exporter Setup: Install on all swarm instances.
  • Metrics: CPU usage, memory, disk I/O, and network performance.
  • Visualization: Use Grafana to create dashboards for real-time and historical hardware insights.
@tomsmith8
Copy link
Contributor Author

@gonzaloaune has suggested some open source solutions to better observe what is going on in swarm. Especially as we keep seeing timeout issues for whinx on Jarvis endpoints but also for general monitoring and debugging.

Tagging for thoughts: @Evanfeenstra @kevkevinpal @tobi-bams

@tomsmith8
Copy link
Contributor Author

@tobi-bams next project I think this would help

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants