Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug:1564372] Setup Nagios server #41

Open
gluster-ant opened this issue Mar 12, 2020 · 12 comments
Open

[bug:1564372] Setup Nagios server #41

gluster-ant opened this issue Mar 12, 2020 · 12 comments
Assignees
Labels
Migrated The bugs migrated from bugzilla to Github Type:Bug

Comments

@gluster-ant
Copy link
Collaborator

URL: https://bugzilla.redhat.com/1564372
Creator: nigelb at redhat
Time: 20180406T06:09:12

We need to setup a nagios server that alerts us to system failures. These include machines which are disconnected from Jenkins and/or have full disk space. It would let our reactions be far more predictive rather than reactive.

This is a long running goal, but for the moment, I'll settle for a nagios server and all machines having nagios clients.

If we want to replace nagios with another equivalent like icinga, that works too.

@gluster-ant gluster-ant added Migrated The bugs migrated from bugzilla to Github Type:Bug labels Mar 12, 2020
@gluster-ant
Copy link
Collaborator Author

Time: 20180409T12:21:53
mscherer at redhat commented:
So, we need to have nagios in the internal vlan, so it can monitor everything.

On the easy side, we can monitor for ping, various metrics (disk space), services that are running.

What policy do we want for alert ? And what SLA/SLE, especially due to timezone difference ?

@gluster-ant
Copy link
Collaborator Author

Time: 20180409T16:20:52
nigelb at redhat commented:
I'd say alert to a list like [email protected]. We'll still do best effort working day coverage. This only enhances our ability to see what fails sooner than someone else noticing the failure.

@gluster-ant
Copy link
Collaborator Author

Time: 20180620T18:25:33
srangana at redhat commented:
This bug reported is against a version of Gluster that is no longer maintained (or has been EOL'd). See https://www.gluster.org/release-schedule/ for the versions currently maintained.

As a result this bug is being closed.

If the bug persists on a maintained version of gluster or against the mainline gluster repository, request that it be reopened and the Version field be marked appropriately.

@gluster-ant
Copy link
Collaborator Author

Time: 20180910T15:28:36
mscherer at redhat commented:
So, I reused the existing role I had, and setup a nagios server.

Now, I need to:

  • move munin internally (server is installed, i need to clean the role, move the data)
  • connect munin/nagios
  • add more check to nagios (the hard part, do that without repeating data all over the place)
  • add more servers

So far, it worked, cause I got paged for a IP v6 problem in the cage (cause there is no ipv6 in the cage in the first place...)

@gluster-ant
Copy link
Collaborator Author

Time: 20180926T11:24:14
mscherer at redhat commented:
So:

All servers managed by ansible are now monitored for ping/ssh (which did permit to see that our freebsd hosts blocked ping, because i got paged for that as soon as I deployed). Aka, all but gerrit prod.

I have added smtp port on supercolony, and vhost checking for a couple of web site, see ansible repo for details.

For now, and while I do clean the roles and stuff, I am the only one receiving alerts, but we will need a plan for the future, I did discuss with nigel on irc.

Notes for myself (and people that care), here the list of things to do:

  • investigate more nrpe (like, security impact on having it opened on the nated IP of the cage)

  • add munin/nagios connexion

  • add check of process:

    • cron
    • custom process
  • add custom check (gerrit, jenkins server being offline, etc)

  • refine httpd check (like more than "http 200")

@gluster-ant
Copy link
Collaborator Author

Time: 20180926T15:57:17
mscherer at redhat commented:
So, munin -> nagios connection do work, but:

  • hit some selinux issue:

type=AVC msg=audit(1537977117.718:115791): avc: denied { search } for pid=19206 comm="send_nsca" name="nagios" dev="dm-0" ino=271810 scontext=system_u:system_r:munin_t:s0-s0:c0.c1023 tcontext=system_u:object_r:nagios_etc_t:s0 tclass=dir

This one shouldn't be too hard to fix.

  • have to understand how munin is supposed to be integrated. For example, I see:

[1537976773] EXTERNAL COMMAND: PROCESS_SERVICE_CHECK_RESULT;supercolony.gluster.org;Disk usage in percent;1;WARNINGs: / is 93.80 (outside range [:92]).
[1537976773] Warning: Passive check result was received for service 'Disk usage in percent' on host 'supercolony.gluster.org', but the service could not be found!

  • see why supercolony do alert, but not the builder at 100% cpu I set up

@gluster-ant
Copy link
Collaborator Author

Time: 20180926T16:36:39
mscherer at redhat commented:
Now, blocked with:

type=AVC msg=audit(1537979718.243:116446): avc: denied { name_connect } for pid=27096 comm="send_nsca" dest=5667 scontext=system_u:system_r:munin_t:s0-s0:c0.c1023 tcontext=system_u:object_r:unreserved_port_t:s0 tclass=tcp_socket

Guess I might need to write my own policy.

@gluster-ant
Copy link
Collaborator Author

Time: 20180927T13:50:37
mscherer at redhat commented:
First step:

fedora-selinux/selinux-policy#229

@gluster-ant
Copy link
Collaborator Author

Time: 20180927T15:03:48
mscherer at redhat commented:
Second step:
fedora-selinux/selinux-policy-contrib#72

In the mean time, I will make munin run as unconfined server side until I can work on a send_nsca policy.

@gluster-ant
Copy link
Collaborator Author

Time: 20180928T15:21:43
mscherer at redhat commented:
So, I did deploy NRPE internally, and testing on the munin server. Right now, it just test the load and for zombie process, but I have code for SElinux, checking the rpm db and I think a architecture for adding more.

@gluster-ant
Copy link
Collaborator Author

Time: 20180928T17:54:40
mscherer at redhat commented:
So, status (again for myself mostly)

  • check for process stuck in Z state is done and working
  • check for selinux is done, tested
  • the munin notification should now clean themself
  • check for specific process is done and working, tested on squid/ubunoun

Next step:

  • verify again NRPE in details (like, is it confined by selinux properly, what can a rogue client achieve)
  • improve notification
  • add more verification on various servers

@gluster-ant
Copy link
Collaborator Author

Time: 20190219T11:28:21
mscherer at redhat commented:
So, NRPE seems to be confined, notification got improved (text message are better than before), and I am adding servers one by one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Migrated The bugs migrated from bugzilla to Github Type:Bug
Projects
None yet
Development

No branches or pull requests

2 participants