Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add disk space alerts to loki, prometheus #88

Open
sed-i opened this issue Nov 21, 2023 · 0 comments
Open

Add disk space alerts to loki, prometheus #88

sed-i opened this issue Nov 21, 2023 · 0 comments

Comments

@sed-i
Copy link
Contributor

sed-i commented Nov 21, 2023

Bug Description

When loki fills up the disk, then cos charms fail.
Looking at juju status it may look like this:

Model               Controller  Cloud/Region        Version  SLA          Timestamp
cos-lite-load-test  uk8s        microk8s/localhost  3.1.6    unsupported  03:26:19Z

App            Version  Status   Scale  Charm                         Channel  Rev  Address         Exposed  Message
alertmanager   0.25.0   active       1  alertmanager-k8s              edge      96  10.152.183.245  no       
catalogue               active       1  catalogue-k8s                 edge      31  10.152.183.95   no       
cos-config     3.5.0    active       1  cos-configuration-k8s         edge      39  10.152.183.106  no       
grafana        9.2.1    active       1  grafana-k8s                   edge      93  10.152.183.29   no       
loki           2.8.4    waiting    0/1  loki-k8s                      edge     104  10.152.183.180  no       waiting for units to settle down
prometheus     2.47.2   waiting    0/1  prometheus-k8s                edge     156  10.152.183.54   no       waiting for units to settle down
scrape-config  n/a      active       1  prometheus-scrape-config-k8s  edge      44  10.152.183.33   no       
scrape-target  n/a      active       1  prometheus-scrape-target-k8s  edge      31  10.152.183.159  no       
traefik                 waiting    0/1  traefik-k8s                   edge     164  10.128.0.6      no       waiting for units to settle down

Unit              Workload  Agent  Address       Ports  Message
alertmanager/0*   active    idle   10.1.174.152         
catalogue/0*      active    idle   10.1.174.140         
cos-config/0*     active    idle   10.1.174.144         
grafana/0*        active    idle   10.1.174.186         
loki/0            unknown   lost   10.1.174.171         agent lost, see 'juju show-status-log loki/0'
prometheus/0      unknown   lost   10.1.174.160         agent lost, see 'juju show-status-log prometheus/0'
scrape-config/0*  active    idle   10.1.174.147         
scrape-target/0*  active    idle   10.1.174.150         
traefik/0         unknown   lost   10.1.174.184         agent lost, see 'juju show-status-log traefik/0'

and debug-log doesn't show anything obvious.

It could be handy to include baked-in disk space alerts (predict_linear etc.).

To Reproduce

Keep Loki running long enough until disk space is exhausted.

Environment

Not limited to a particular env.

Relevant log output

$ df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/root        97G   96G  1.4G  99% /
tmpfs           7.9G     0  7.9G   0% /dev/shm
tmpfs           3.2G   11M  3.2G   1% /run
tmpfs           5.0M     0  5.0M   0% /run/lock
/dev/sda15      105M  6.1M   99M   6% /boot/efi
tmpfs           1.6G  4.0K  1.6G   1% /run/user/1000

Additional context

No response

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant