Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Latest/edge is not collecting logs for OpenSearch snap #226

Closed
gabrielcocenza opened this issue Dec 20, 2024 · 9 comments
Closed

Latest/edge is not collecting logs for OpenSearch snap #226

gabrielcocenza opened this issue Dec 20, 2024 · 9 comments

Comments

@gabrielcocenza
Copy link
Member

gabrielcocenza commented Dec 20, 2024

Bug Description

Playing around with OpenSearch and grafana-agent, I discovered that latest/edge is not forwarding logs to Loki, but latest/stable is working as expected.
I'm not sure if it's a misconfiguration on grafana-agent or in OpenSearch side....

Checking the difference of /etc/grafana-agent.yaml from stable (1.yaml) to edge (2.yaml)

diff 1.yaml 2.yaml
32a33
>     - drm
66c67
<     rootfs_path: /var/lib/snapd/hostfs
---
>     rootfs_path: /
98a100
>           __path_exclude__: ''
121c123
<     - job_name: opensearch-common-var-log-opensearch
---
>     - job_name: opensearch-var-log-opensearch
124a127,130
>       - structured_metadata:
>           filename: filename
>       - labeldrop:
>         - filename
126c132
<       - replacement: /opensearch
---
>       - replacement: /var/log/opensearch
132,133c138,139
<           __path__: /snap/grafana-agent/51/shared-logs/opensearch/**
<           job: opensearch-common-var-log-opensearch
---
>           __path__: /var/snap/opensearch/common/var/log/opensearch/**
>           job: opensearch-var-log-opensearch
211a218
> traces: {}

The folder for logs are protected with 770. Not sure if grafana-agent will be able to read from here:

ls -l /var/snap/opensearch/common/var/log
total 12
drwxrwx--- 2 snap_daemon root 12288 Dec 20 12:36 opensearch

To Reproduce

1 - Deploy COS-lite, deploy Opensearch and grafana-agent on stable
2 - See that snap logs appears on Grafana
3 - Change to grafana-agent on edge
4 - Logs are not propagated anymore

Environment

I've used lxd cloud for Opensearch and microk8s for COS.
Applications were using base 22.04

@gabrielcocenza gabrielcocenza changed the title Latest/edge is not collecting logs from OpenSearch snap Latest/edge is not collecting logs for OpenSearch snap Dec 20, 2024
@gabrielcocenza
Copy link
Member Author

Another thing that is hitting is that if I try to go back to latest/stable I'm encountering this issue:

juju refresh grafana-agent --channel=latest/stable --force --force-units
Added charm-hub charm "grafana-agent", revision 223 in channel latest/stable, to the model
ERROR cannot downgrade from v2 charm format to v1

@sed-i
Copy link
Contributor

sed-i commented Dec 20, 2024

Thanks! Moved to top of backlog.

I wonder if __path_exclude__: '' matches everything. @lucabello?

The v2 vs v1 is regarding metadata schema, because we switched to consolidated charmcraft.yaml. We didn't realize at the time that it's "irreversible".

@lucabello
Copy link
Contributor

I'm looking into this and trying to reproduce. If the empty value matches everything, we can simply not add the __path_exclude__ label at all if it's empty, so the fix should be relatively easy.

@lucabello
Copy link
Contributor

lucabello commented Jan 9, 2025

I tried to reproduce this and wasn't able to. @gabrielcocenza could you provide a juju export-bundle of your deployment / share the grafana-agent logs so I can try to reproduce with that?

To share the agent logs, you can juju ssh into the machine and send the result of snap logs grafana-agent.

@gabrielcocenza
Copy link
Member Author

gabrielcocenza commented Jan 9, 2025

@lucabello

This is the bundle: https://pastebin.canonical.com/p/C2xqFD9Yt4/

The logs are: https://pastebin.canonical.com/p/BCXc5PQ4HC/

Edit: the bundle is using grafana-agent stable. It's when refreshed to use the edge that the error will show

@lucabello
Copy link
Contributor

I tried reproducing this with Zookeeper, even refreshing from stable to edge (same revisions as the reproducer), and couldn't reproduce. I still get logs from the agent, and if I inject a new log file in the machine, it's picked up immediately.

This leads me to believe the issue in the integration with Opensearch.

Testing __path_exclude__ and /var/log logs

opensearch is blocked due to vm.swappiness should be at most 1, so it's producing none of its own logs; however, if __path_exclude__ is the issue, we can verify it by just checking the contents of /var/log are making it to Loki.

Tried to reproduce with the attached bundle. I noticed that grafana-agent from stable uses the strictly-confined snap, as you can see from snap info grafana-agent on the machine:

tracking:     latest/stable
refresh-date: today at 09:01 UTC
hold:         forever
channels:
  latest/stable:          0.40.4 2024-11-14  (84) 299MB -
  latest/candidate:       ↑
  latest/beta:            ↑
  latest/edge:            0.40.4 2024-12-13 (106) 299MB -
  0.40-classic/stable:    0.40.4 2024-08-14  (82) 229MB classic
  0.40-classic/candidate: ↑
  0.40-classic/beta:      ↑
  0.40-classic/edge:      0.40.4 2024-12-13 (107) 299MB classic
installed:                0.40.4             (51) 299MB held

I tried to add a new log file to see if it's picked up:

ubuntu@juju-2f9ff5-2:~$ echo "First test!" | sudo tee /var/log/01-test.log
First test!

Screenshot-2025-01-10_10-15

Indeed, everything seems to be making it to Loki with the Grafana Agent from stable.

Refreshing the charm to edge changes the snap to the classic confinement by default.

installed:                0.40.4             (95) 299MB classic,held

I tried to add a new log file again:

ubuntu@juju-2f9ff5-2:~$ echo "Second test!" | sudo tee /var/log/02-test.log
Second test!

Screenshot-2025-01-10_10-21

The logs are still working, and new files are being picked up.

In fact, even your grafana-agent logs suggest that; the agent is watching new directory, found a new file (tail routine: started, path=/var/log/landscape/sysinfo.log):

2025-01-09T20:50:19Z grafana-agent.grafana-agent[88349]: ts=2025-01-09T20:50:19.910305034Z caller=filetarget.go:313 level=info component=logs logs_config=log_file_scraper msg="watching new directory" directory=/var/log/landscape
2025-01-09T20:50:19Z grafana-agent.grafana-agent[88349]: ts=2025-01-09T20:50:19.910485268Z caller=tailer.go:147 level=info component=logs logs_config=log_file_scraper component=tailer msg="tail routine: started" path=/var/log/landscape/sysinfo.log
2025-01-09T20:50:19Z grafana-agent.grafana-agent[88349]: ts=2025-01-09T20:50:19.910548862Z caller=log.go:168 component=logs logs_config=log_file_scraper level=info msg="Seeked /var/log/landscape/sysinfo.log - &{Offset:0 Whence:0}"

Testing /var/snap/opensearch

I did a similar test to see if logs from opensearch would be picked up, and everything seems to be working correctly:

ubuntu@juju-2f9ff5-0:~$ echo "Test three, go!" | sudo tee /var/snap/opensearch/common/var/log/opensearch/03-test.log
Test three, go!

Screenshot-2025-01-10_10-35

@gabrielcocenza
Copy link
Member Author

gabrielcocenza commented Jan 10, 2025

@lucabello this query that you shared works indeed and I can see the logs.

The query that I was doing before was:

image

Then after changing to latest/edge, I couldn't find it anymore:
image

So, I cannot find a file like /snap/grafana-agent/51/shared-logs/opensearch/opensearch-ys4x.log after using edge which lead me to think that logs were not being propagated.

Is this an expected behavior? Was I doing a weird query?

@sed-i
Copy link
Contributor

sed-i commented Jan 10, 2025

I haven't looked into this, but perhaps it's related to the diff you posted?

-           __path__: /snap/grafana-agent/51/shared-logs/opensearch/**
+           __path__: /var/snap/opensearch/common/var/log/opensearch/**

...because now the default is classically confined gagent snap.

@lucabello
Copy link
Contributor

As we figured out in a call with @gabrielcocenza, there is no bug. The reason you can't see the log files under the filename label is because that is now part of structured metadata instead to avoid high-cardinality labels in Loki.

If you want to read more about why, here's a Discourse post from Jose!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants