Production ready monitoring cluster based on Grafana Stack, Hetzner and Cloudflare.
Welcome to the Askrella Sauron project! π This project is designed to provision a fully functional, production ready cluster that includes essential monitoring and logging tools such as Grafana, Prometheus, Thanos, Loki, Tempo, Node-Exporter, and Promtail. The infrastructure is hosted on Hetzner Cloud, utilizing IPv6 for cost efficiency and enhanced performance.
- Overview π―
- Infrastructure Diagram ποΈ
- Components π§©
- Network and Security π
- Prerequisites β
- Deployment π
- Infrastructure Provisioning Steps π
- Example Pricing π°
- FAQ β
- TODO π
- Contributing π€
- License βοΈ
- Maintainers π₯
This project is part of the Askrella company and aims to provide a robust and scalable monitoring solution. The cluster may be deployed across multiple Hetzner regions, ensuring high availability and failover capabilities. The infrastructure leverages Hetzner's private subnet and firewall features, with routing managed by Caddy instances on each node. Load balancing and failover are handled by Cloudflare LoadBalancer, and IP addresses are published via Cloudflare DNS records.
- Grafana π: A powerful visualization and analytics tool.
- Prometheus π: A monitoring system and time series database.
- Grafana Loki π: A log aggregation system.
- Grafana Tempo β‘: A distributed tracing backend.
- Node-Exporter π₯οΈ: An exporter for hardware and OS metrics.
- Promtail π‘: A log collector that ships logs to Loki.
- Thanos πΎ: A highly available system for querying metrics across multiple nodes & clusters.
- Hetzner Object Store πΎ: An S3-compatible object storage service used as a backend for storing database data.
The following dashboards are included by default in the Grafana setup:
- cAdvisor Dashboard: Provides insights into container resource usage and performance metrics.
- Node Exporter Dashboard: Displays hardware and OS metrics collected from the node.
These dashboards are pre-configured and can be accessed through the Grafana interface once the cluster is deployed.
- Private Subnet: All servers are hosted in a private Hetzner subnet, enhancing security and reducing costs.
- IPv6: Utilized for cost efficiency and modern networking capabilities.
- Firewall: Configured to allow only necessary traffic, ensuring secure communication.
- Caddy: Deployed on each node for efficient routing and TLS termination.
- Load Balancing: Hetzner LoadBalancer provides load balancing and failover capabilities.
- DNS: IP addresses are published via Cloudflare DNS records for easy access.
Before you begin deploying the Askrella Sauron cluster, ensure you have the following prerequisites in place:
- Hetzner Account: Create an account on Hetzner Cloud and obtain an API token for provisioning resources.
- Cloudflare Account: Set up a Cloudflare account to manage DNS records and load balancing.
- Google OAuth Credentials: Set up OAuth 2.0 credentials in Google Cloud Console for Grafana authentication.
- SSH Key: Generate an SSH key pair for secure access to the servers.
# Generate SSH key ssh-keygen -t ed25519 -C "[email protected]"
- Operating System: A Unix-based system (Linux or macOS) is recommended for running the deployment scripts.
- Hetzner API Token: Create an API token on Hetzner Cloud.
- Cloudflare API Token: Create an API token on Cloudflare. The token needs to have the following permissions for the domain you want to use:
- Zone DNS: Read, Edit
- Zone Settings: Read, Edit
- Zone Load Balancing: Read, Edit
- Health Checks: Read, Edit
- Origin Rules: Read, Edit
- Account: Load Balancing: Account LoadBalancing: Read, Edit
- Account: Load Balancing: Monitors And Pools: Read, Edit
- Enabled Clouflare LoadBalancer: Enable the LoadBalancer in the Cloudflare dashboard by going to your domain, clicking on "Traffic" and then "Load Balancing". If you don't do this, the script will fail with an interval validation error for the monitor (range [0,0]). See Example Pricing for more details.
- Google OAuth Client ID and Secret: Create OAuth credentials for secure authentication. This may be set to internal use only.
- Git
- Terraform v1.10.3 or newer
- IPv6 Support: Ensure your network supports IPv6, as the infrastructure leverages IPv6 for cost efficiency.
- Firewall Configuration: Allow necessary ports for SSH, HTTP, and HTTPS traffic.
Once you have all the prerequisites in place, you can proceed with the deployment steps outlined in the Deployment section.
To deploy the cluster, follow these steps:
-
Clone the Repository:
git clone https://github.com/askrella/sauron.git cd sauron
-
Configure Terraform Variables: Update the
terraform.tfvars
file according to this structure:
hcloud_token = ""
minio_bucket = "sauron-bucket"
minio_user = "your-user"
minio_password = "your-password"
minio_region = "nbg1"
# We recommend to start with 3 nodes and scale up if you need more.
# Having less nodes will work, but limit the failover and scaling capabilities of your cluster.
cluster_size = 3
base_domain = "example.com"
domain = "monitoring.example.com"
otel_collector_username = "otel"
# Make sure to escape the BCrypt hashed password
otel_collector_password = "$${2y}$$10$$PJfizyuW5JdSVlLHsRq8O.tMxrqOcpR0jtz0PXo5u3fRA8Ue.YL.C"
cloudflare_api_token = ""
cloudflare_account_id = ""
grafana_admin_password = ""
gf_server_root_url = "https://grafana.example.com"
gf_auth_google_client_id = "your-id.apps.googleusercontent.com"
gf_auth_google_client_secret = "your-secret"
gf_auth_google_allowed_domains = "example.com"
-
Initialize Terraform:
terraform init
-
Apply the Terraform Plan:
terraform plan -out=tfplan terraform apply tfplan
-
Verify Deployment: Ensure all services are running and accessible via the configured DNS records.
5.1 Connect to Grafana and login with the credentials you provided in the
terraform.tfvars
file. Check the dashboards and logs to make sure everything is working.5.2 Connect to the OTel Collector and login with the credentials you provided in the
terraform.tfvars
file. You can connect your application to the OTel Collector by using the following configuration:endpoint http: http://example.com:2053 (basic auth with the credentials you provided in the `terraform.tfvars` file) endpoint grpc: http://example.com:2083 (basic auth with the credentials you provided in the `terraform.tfvars` file)
Please keep in mind to persist the terraform state files contained in the terraform/cluster
and terraform
directories.
The number of nodes you should provision depends on your use case. We recommend to start with 3 nodes and scale up if you need more. When running on less than 3 nodes, the replication factor (e.g. Loki)will be set to the number of nodes, which is usually not recommended to reach a quorum.
The infrastructure setup is a multi-step process that ensures a robust and secure environment for your applications. Below is a detailed breakdown of the steps involved:
-
Provision Hetzner Infrastructure
1.1 SSH Key: Generate and configure SSH keys for secure access to the servers.
1.2 Private Network: Set up a private network within Hetzner to ensure secure communication between nodes.
1.3 Firewall: Configure firewall rules to allow only necessary traffic, enhancing security.
1.4 LoadBalancer: Deploy a Hetzner LoadBalancer to distribute traffic evenly across the nodes and provide failover capabilities.
1.5 Servers: Provision the required number of servers in the specified Hetzner regions.
1.6 Object Store: Set up Hetzner Object Store as an S3-compatible backend for storing database data.
1.7 SSH Connection Check: Verify SSH connectivity to each server to ensure they are accessible and ready for configuration.
-
Node Setup for Each Server via Terraform Invoke
2.1 Docker Install: Install Docker on each server to facilitate containerized application deployment.
2.2 Configuration Transfer: Transfer necessary configuration files to each server.
2.3 Container Creation & Startup: Create and start Docker containers for Grafana, Prometheus, Loki, Tempo, Node-Exporter, and Promtail.
-
Cloudflare DNS
3.1 DNS Records: Configure Cloudflare DNS records to publish the IP addresses of the servers:
- Create AAAA records to map hostnames to IPv6 addresses
- Verify DNS propagation after record creation
Here's an example of the monthly costs for running a cluster with three CAX11 nodes and a Cloudflare LoadBalancer:
-
Hetzner CAX11 Nodes:
- 3 nodes x 3.29β¬ per node = 9.87β¬ per month
-
Hetzner Object Store:
- 1TB Storage & 1TB Traffic = 5β¬ per month
-
Cloudflare LoadBalancer:
- 3 nodes = $10.00 per month ($5 Base Fee + $5 for 3. node)
Note: The pricing is based on the current Hetzner and Cloudflare pricing at the time of writing this README (December 2024). You should also consider traffic costs between the regions since it is treated as outbound traffic. Please refer to the latest pricing information at:
- Hetzner Cloud Pricing: https://www.hetzner.com/cloud
- Hetzner Object Storage Pricing: https://www.hetzner.com/storage/object-storage/
- Cloudflare Load Balancer Pricing: https://www.cloudflare.com/plans/load-balancer-pricing/
Total Estimated Cost: ~25β¬ per month
We initially hosted the monitoring cluster on our own servers but encountered significant storage wear issues:
- In just 7 days of monitoring ~17 containers + hosts, we experienced:
- 26% wear on high-grade NVMe SSDs
- ~1.73PB of data written
- ~1.22PB of data read
This excessive I/O load led us to switch to managed VMs, which provided several benefits: β¨
- Eliminated concerns about hardware wear and maintenance π§
- Improved availability through multi-datacenter deployment π
- Reduced operational overhead π
- Cost-effective scaling π°
We previously used Elastic Cloud and experienced significant issues when running out of storage (due to excessive business growth) and incurring high costs due to the amount of traffic we were generating.
Ensuring SLOs required us to monitor hundreds of services, websites, databases and tools. When simplifying the monitoring setup, we chose to use Elastic Synthetic Monitors and experienced a bill higher than the cost of the cluster. This lead to us setting up dedicated locations for the agents therefore reducing the cost, but also increasing the complexity of the setup.
Sauron now provides a simple way to monitor without the need to setup dedicated locations for the agents, cost and complexity.
Sauron decreased our overall costs by 65% and provided a more reliable and scalable solution due to the increased number of nodes.
While having a large set of features and integrations, our previous Elastic Cloud cluster could not keep up with maintaining our indices and queries. We usually had to wait for 30+ seconds for the results to be returned. With Sauron we are now able to query the data from way more advanced dashboards and get the results in just a few seconds. π
You can monitor the creation and startup of the containers by running docker events --filter event=create --filter event=start --filter event=mount
on the server.
You can connect to a server by using the ssh key provided in the terraform directory.
Example:
cd terraform && ssh -i terraform/id_ed25519 -o "StrictHostKeyChecking=no" -o "UserKnownHostsFile=/dev/null" root@2a02:1e8:c012:ddd1::1
- Autobase allows us to provision a production ready Postgres cluster and provides a simple UI too.
- Tests
- First-time deployment
- Re-deploy
- Adding nodes
- Removing nodes
- Upgrading from main to new version (checkout stable main and change to new version)
- Encryption between nodes in internal network
- Uptime monitoring using chromium based agent
- Alerting
- Grafana shared database
- OnCall integration
-
We are running all components on each node. This means:
- the containers may affect each others performance,
- we don't utilize the scaling effects of scaling each component separately of one another,
- we cannot scale infinitely since we are limited by the overhead of all the "monolithic" nodes communicating with all other nodes.
- However: The load is distributed since each Prometheus instance only manages a portion of recent metrics. This allows us to scale horizontally up to a reasonable number of instances and gain performance increases inversely proportional to the number of instances while maintaining high availability. Additionally, Thanos provides long-term storage and querying capabilities for historical metrics spanning multiple years. We can also tolerate peaks better than a single instance of e.g. Prometheus, since the data ingestion is distributed across multiple nodes.
-
We are affected by kreuzwerker/terraform-provider-docker#648 which is a known issue with the docker provider leading to re-deployment of all containers when networks_advanced is used. This behavior unfortunately also leads to downtime of the cluster.
We welcome contributions from the community!
This project is licensed under the MIT License. See the LICENSE file for more details.
Askrella - Consulting is more than giving advice
- Steve - steve-hb - [email protected]
For any questions or support, please contact the Askrella team at [email protected].
Happy Monitoring! π π π