Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Solr metrics #291

Open
wants to merge 3 commits into
base: monitoring
Choose a base branch
from
Open

Solr metrics #291

wants to merge 3 commits into from

Conversation

mwiencek
Copy link
Member

This is based on #290 and I'm launching it in the same way (admin/configure add monitoring + docker compose up -d).

I followed https://solr.apache.org/guide/solr/latest/deployment-guide/monitoring-with-prometheus-and-grafana.html and copied the default dashboard under contrib/, only fixing the Prometheus data source name.

@mwiencek
Copy link
Member Author

There's a temporary commit here that requires you to build a local metabrainz/mb-solr:solr-9.7.0 image from the solr-9.7.0 branch of the mb-solr repository first:

docker build -t metabrainz/mb-solr:solr-9.7.0 -f Dockerfile .

@reosarevok
Copy link
Member

This works nicely locally - to be honest I have zero clues what is useful data for solr for us, but in https://docs.google.com/document/d/1vQBiHdxO_qkmxUAfS10QWhoULSYcPazxZWSxsxJMS9o/edit?pli=1&tab=t.0 we wanted "Number of collection documents & Data size" - is that in there? There's so much stuff, we should document exactly where to look if so!

@mwiencek
Copy link
Member Author

I also added a basic healthcheck to the search service here, though the indexer services still fails with an HTTP 500 on startup:

indexer-1          | 2025-01-22 05:51:49,394: Retrying...
indexer-1          | 2025-01-22 05:51:49,396: HTTP Error 500: Server Error
indexer-1          | Traceback (most recent call last):
indexer-1          |   File "/usr/local/lib/python2.7/site-packages/retrying.py", line 200, in call
indexer-1          |     attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
indexer-1          |   File "sir/amqp/handler.py", line 444, in _watch_impl
indexer-1          |     handler = Handler(entities)
indexer-1          |   File "sir/amqp/handler.py", line 148, in __init__
indexer-1          |     self.cores[core_name] = solr_connection(core_name)
indexer-1          |   File "sir/util.py", line 88, in solr_connection
indexer-1          |     urllib2.urlopen(ping_uri)
indexer-1          |   File "/usr/local/lib/python2.7/urllib2.py", line 154, in urlopen
indexer-1          |     return opener.open(url, data, timeout)
indexer-1          |   File "/usr/local/lib/python2.7/urllib2.py", line 435, in open
indexer-1          |     response = meth(req, response)
indexer-1          |   File "/usr/local/lib/python2.7/urllib2.py", line 548, in http_response
indexer-1          |     'http', request, response, code, msg, hdrs)
indexer-1          |   File "/usr/local/lib/python2.7/urllib2.py", line 473, in error
indexer-1          |     return self._call_chain(*args)
indexer-1          |   File "/usr/local/lib/python2.7/urllib2.py", line 407, in _call_chain
indexer-1          |     result = func(*args)
indexer-1          |   File "/usr/local/lib/python2.7/urllib2.py", line 556, in http_error_default
indexer-1          |     raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
indexer-1          | HTTPError: HTTP Error 500: Server Error
indexer-1          | 2025-01-22 05:51:49,399: Connecting to Solr failed: HTTP Error 500: Server Error

@mwiencek
Copy link
Member Author

we wanted "Number of collection documents & Data size" - is that in there?

"Searcher Documents" and "Index Size" are both available under "Core Metrics." They seem to be tracked per shard IIUC.

@mwiencek
Copy link
Member Author

@yvanzo A couple small points:

  1. I realized the Solr health check I added was also useful for making sure the Solr exporter container (for Prometheus) didn't crash on startup. So I kept it, but converted it to a Perl script which checks that every collection is active. Do you see any problem with this method?

  2. Regarding sir, I tried switching from live-indexing-search to sir-dev locally, but this causes:

    indexer-1          | /usr/local/bin/docker-entrypoint.sh: line 15: MUSICBRAINZ_POSTGRES_SERVER: unbound variable
    

    I'm a bit confused because the README says this should default to db. However, even after adding MUSICBRAINZ_POSTGRES_SERVER=db to my .env file and running docker compose up --build indexer -d, it still fails with the same error.

@mwiencek mwiencek mentioned this pull request Jan 29, 2025
Copy link
Contributor

@yvanzo yvanzo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the progress!

@yvanzo A couple small points:

1. I realized the Solr health check I added was also useful for making sure the Solr exporter container (for Prometheus) didn't crash on startup. So I kept it, but converted it to a Perl script which checks that every collection is active. Do you see any problem with this method?

It seems useful to have a health check. I made a few comments on the implementation.

2. Regarding sir, I tried switching from live-indexing-search to sir-dev locally, but this causes:
   ```
   indexer-1          | /usr/local/bin/docker-entrypoint.sh: line 15: MUSICBRAINZ_POSTGRES_SERVER: unbound variable
   ```

That’s a regression I just fixed in a separate PR about SIR development setup. You can cherry-pick the commit 5c5e6ec.

@@ -0,0 +1,60 @@
#!/usr/bin/perl
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please avoid introducing Perl in repositories that don’t use it already.

Python is also available in Solr image.

Copy link
Member Author

@mwiencek mwiencek Jan 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only chose Perl because Python wasn't available in the image for me:

solr@22a1a3f26448:/opt/solr-9.7.0$ python
bash: python: command not found
solr@22a1a3f26448:/opt/solr-9.7.0$ perl --version

This is perl 5, version 34, subversion 0 (v5.34.0) built for x86_64-linux-gnu-thread-multi
[...]

Am I missing something? I'm fine with installing Python in the image instead. I could even use bash + jq. But I figured a small amount of Perl would be fine for a tiny one-off script.

(Edit: Perhaps Python was available in the current Solr image, but not the image built from the solr-9.7.0 branch.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, my bad, I mistook with another image and didn’t actually checked Solr image.

Yes, Bash would be fine.

To detect whether Solr is running locally or listening to the outside, I didn’t find anything useful in the APIs. This seems to work:
ps auxww | grep Djetty.host=localhost

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried using lsof as discussed on IRC, and although it easily worked for detecting whether Solr was available for outside connections, the solr-exporter container still wasn't happy until the collections were healthy:

solr-exporter  | INFO  - 2025-01-31 17:33:40.329; org.apache.solr.client.solrj.impl.CloudSolrClient; Request to collection [annotation] failed due to (510) org.apache.solr.common.SolrException: Could not find a healthy node to handle the request., retry=0 maxRetries=5 commError=false errorCode=510 

Merging the lsof check with the collection health check script seems to work well, though.

Comment on lines 9 to 25
my @collections = qw(
area
artist
cdstub
editor
event
instrument
label
place
recording
release
release-group
series
tag
url
work
);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it needed to check each collection separately?

Maybe a collection-agnostic query can be made instead? OVERSEERSTATUS for example?

Otherwise, I would suggest to fetch the list of collections either from API, or from the directory /usr/lib/mbsssss.

Copy link
Member Author

@mwiencek mwiencek Jan 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any fields related to the health or status of the collections in the OVERSEERSTATUS response locally, so I'm not sure that one would work.

Getting the list from /usr/lib/mbsssss sounds like a good idea. I avoided using the API in case a collection failed to be created and wasn't returned from the API. In Matrix you mentioned that:

Solr isn’t accepting connections from outside the container when creating collections.

So I figured we should be sure all of the known collections are active first.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the other discussion:

To detect whether Solr is running locally or listening to the outside, I didn’t find anything useful in the APIs. This seems to work: ps auxww | grep Djetty.host=localhost

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(From the other discussion again), I ended up merging the lsof check with the collection health check script to avoid "Could not find a healthy node" type errors in the solr-exporter logs.

I did update the script to get the collection names from /usr/lib/mbsssss.

@yvanzo
Copy link
Contributor

yvanzo commented Jan 31, 2025

@mwiencek: I merged SIR dev stuff into master and rebased the target branch monitoring on it. That’s why there are conflicts that need to be resolved when you will update it with your additional changes.

@mwiencek mwiencek marked this pull request as ready for review February 6, 2025 16:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants