[bug:1564372] Setup Nagios server #41

gluster-ant · 2020-03-12T13:03:11Z

URL: https://bugzilla.redhat.com/1564372
Creator: nigelb at redhat
Time: 20180406T06:09:12

We need to setup a nagios server that alerts us to system failures. These include machines which are disconnected from Jenkins and/or have full disk space. It would let our reactions be far more predictive rather than reactive.

This is a long running goal, but for the moment, I'll settle for a nagios server and all machines having nagios clients.

If we want to replace nagios with another equivalent like icinga, that works too.

gluster-ant · 2020-03-12T13:03:12Z

Time: 20180409T12:21:53
mscherer at redhat commented:
So, we need to have nagios in the internal vlan, so it can monitor everything.

On the easy side, we can monitor for ping, various metrics (disk space), services that are running.

What policy do we want for alert ? And what SLA/SLE, especially due to timezone difference ?

gluster-ant · 2020-03-12T13:03:13Z

Time: 20180409T16:20:52
nigelb at redhat commented:
I'd say alert to a list like [email protected]. We'll still do best effort working day coverage. This only enhances our ability to see what fails sooner than someone else noticing the failure.

gluster-ant · 2020-03-12T13:03:13Z

Time: 20180620T18:25:33
srangana at redhat commented:
This bug reported is against a version of Gluster that is no longer maintained (or has been EOL'd). See https://www.gluster.org/release-schedule/ for the versions currently maintained.

As a result this bug is being closed.

If the bug persists on a maintained version of gluster or against the mainline gluster repository, request that it be reopened and the Version field be marked appropriately.

gluster-ant · 2020-03-12T13:03:14Z

Time: 20180910T15:28:36
mscherer at redhat commented:
So, I reused the existing role I had, and setup a nagios server.

Now, I need to:

move munin internally (server is installed, i need to clean the role, move the data)
connect munin/nagios
add more check to nagios (the hard part, do that without repeating data all over the place)
add more servers

So far, it worked, cause I got paged for a IP v6 problem in the cage (cause there is no ipv6 in the cage in the first place...)

gluster-ant · 2020-03-12T13:03:15Z

Time: 20180926T11:24:14
mscherer at redhat commented:
So:

All servers managed by ansible are now monitored for ping/ssh (which did permit to see that our freebsd hosts blocked ping, because i got paged for that as soon as I deployed). Aka, all but gerrit prod.

I have added smtp port on supercolony, and vhost checking for a couple of web site, see ansible repo for details.

For now, and while I do clean the roles and stuff, I am the only one receiving alerts, but we will need a plan for the future, I did discuss with nigel on irc.

Notes for myself (and people that care), here the list of things to do:

investigate more nrpe (like, security impact on having it opened on the nated IP of the cage)
add munin/nagios connexion
add check of process:
- cron
- custom process
add custom check (gerrit, jenkins server being offline, etc)
refine httpd check (like more than "http 200")

gluster-ant · 2020-03-12T13:03:16Z

Time: 20180926T15:57:17
mscherer at redhat commented:
So, munin -> nagios connection do work, but:

hit some selinux issue:

type=AVC msg=audit(1537977117.718:115791): avc: denied { search } for pid=19206 comm="send_nsca" name="nagios" dev="dm-0" ino=271810 scontext=system_u:system_r:munin_t:s0-s0:c0.c1023 tcontext=system_u:object_r:nagios_etc_t:s0 tclass=dir

This one shouldn't be too hard to fix.

have to understand how munin is supposed to be integrated. For example, I see:

[1537976773] EXTERNAL COMMAND: PROCESS_SERVICE_CHECK_RESULT;supercolony.gluster.org;Disk usage in percent;1;WARNINGs: / is 93.80 (outside range [:92]).
[1537976773] Warning: Passive check result was received for service 'Disk usage in percent' on host 'supercolony.gluster.org', but the service could not be found!

see why supercolony do alert, but not the builder at 100% cpu I set up

gluster-ant · 2020-03-12T13:03:17Z

Time: 20180926T16:36:39
mscherer at redhat commented:
Now, blocked with:

type=AVC msg=audit(1537979718.243:116446): avc: denied { name_connect } for pid=27096 comm="send_nsca" dest=5667 scontext=system_u:system_r:munin_t:s0-s0:c0.c1023 tcontext=system_u:object_r:unreserved_port_t:s0 tclass=tcp_socket

Guess I might need to write my own policy.

gluster-ant · 2020-03-12T13:03:18Z

Time: 20180927T13:50:37
mscherer at redhat commented:
First step:

fedora-selinux/selinux-policy#229

gluster-ant · 2020-03-12T13:03:19Z

Time: 20180927T15:03:48
mscherer at redhat commented:
Second step:
fedora-selinux/selinux-policy-contrib#72

In the mean time, I will make munin run as unconfined server side until I can work on a send_nsca policy.

gluster-ant · 2020-03-12T13:03:20Z

Time: 20180928T15:21:43
mscherer at redhat commented:
So, I did deploy NRPE internally, and testing on the munin server. Right now, it just test the load and for zombie process, but I have code for SElinux, checking the rpm db and I think a architecture for adding more.

gluster-ant · 2020-03-12T13:03:21Z

Time: 20180928T17:54:40
mscherer at redhat commented:
So, status (again for myself mostly)

check for process stuck in Z state is done and working
check for selinux is done, tested
the munin notification should now clean themself
check for specific process is done and working, tested on squid/ubunoun

Next step:

verify again NRPE in details (like, is it confined by selinux properly, what can a rogue client achieve)
improve notification
add more verification on various servers

gluster-ant · 2020-03-12T13:03:22Z

Time: 20190219T11:28:21
mscherer at redhat commented:
So, NRPE seems to be confined, notification got improved (text message are better than before), and I am adding servers one by one.

gluster-ant added Migrated The bugs migrated from bugzilla to Github Type:Bug labels Mar 12, 2020

gluster-ant assigned mscherer Mar 12, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bug:1564372] Setup Nagios server #41

[bug:1564372] Setup Nagios server #41

gluster-ant commented Mar 12, 2020

gluster-ant commented Mar 12, 2020

gluster-ant commented Mar 12, 2020

gluster-ant commented Mar 12, 2020

gluster-ant commented Mar 12, 2020

gluster-ant commented Mar 12, 2020

gluster-ant commented Mar 12, 2020

gluster-ant commented Mar 12, 2020

gluster-ant commented Mar 12, 2020

gluster-ant commented Mar 12, 2020

gluster-ant commented Mar 12, 2020

gluster-ant commented Mar 12, 2020

gluster-ant commented Mar 12, 2020

[bug:1564372] Setup Nagios server #41

[bug:1564372] Setup Nagios server #41

Comments

gluster-ant commented Mar 12, 2020

gluster-ant commented Mar 12, 2020

gluster-ant commented Mar 12, 2020

gluster-ant commented Mar 12, 2020

gluster-ant commented Mar 12, 2020

gluster-ant commented Mar 12, 2020

gluster-ant commented Mar 12, 2020

gluster-ant commented Mar 12, 2020

gluster-ant commented Mar 12, 2020

gluster-ant commented Mar 12, 2020

gluster-ant commented Mar 12, 2020

gluster-ant commented Mar 12, 2020

gluster-ant commented Mar 12, 2020