-
-
Notifications
You must be signed in to change notification settings - Fork 102
Nagios: Addition and Removal of Machines Without the Playbooks
If, for whatever reason, a server needs to be monitored, but can't be setup via the Nagios_Ansible_Config_Tool, the following steps can be done manually.
The process to configure the client machines is fairly simple. However, the implementation will vary slightly by OS and architecture.
Key requirements:
- Local user named ‘nagios’
- SSH public key authentication (authorized_keys)
- Nagios plugins
This will vary depending on the OS, but the following rules need to be satisfied:
- The user must be called 'nagios'
- The user must have permissions to access
/usr/local/nagios/libexec/
- The user must have a
~/.ssh/authorised_keys
file defined, containing the Nagios Server's Nagios user'sid_rsa.pub
The ssh key can be copied to the Nagios Server via: ssh-copy-id -i ~/.ssh/id_rsa.pub [email protected]
Most distributions are able to install the nagios-plugins
from their package manager.
i.e. On Ubuntu 20.04 / Linux Mint 20
sudo apt install nagios-plugins
If this isn't possible on your distribution, the common Nagios plugins can be built by following this guide.
Once this has been done, ensure that the plugins can be accessed at /usr/local/nagios/libexec
- if this is not the case, symlink the folder to that location:
ln -s /usr/lib/nagios/plugins /usr/local/nagios/libexec
With the client setup, the Nagios server needs to be told to monitor the client, and what services to monitor. The template for what we typically monitor a client for can be found here, with additional service definitions found in here. An overview of the checks we use can be found in the overview page. Copy and paste the templates / service checks required into a file at /usr/local/nagios/etc/servers/*HOSTNAME*.cfg
, on the Nagios Server.
Different distributions will require different checks, depending on their package manager, and the method in which they sync time. For distributions that use systemd.timesyncd
, use the check_timesync.cfg
template. For distributions using NTP
, use the check_ntp_timesync.cfg
template.
There are several items that will need to be changed in these template files, to make the server definition specific to the machine. Here's a table of the values that need to be changed:
Template String | Changed to | Example |
---|---|---|
ReplaceHostName | The machines name as it appears in the inventory.yml
|
build-spearhead-freebsd12-x64-1 |
ReplaceAliasDescription | A description of the machine | Add by Ansible |
ReplaceIPAddress | IP Address of the Machine | 185.131.222.224 |
Note: The Add by Ansible
message is set when the machine has automatically been added via Ansible. When manually setting up machines, the description may be more important. i.e: ci.adoptopenjdk.net
's message is AdoptOpenJDK - Jenkins server
.
When adding a new machine to be monitored, it's important to add it to the correct hostgroup in the /usr/local/nagios/etc/objects/hostgroups.cfg
file. If the machine is part of an already defined hostgroup (i.e. spearhead
, ibmcloud
, marist
, etc), then all that's required is to add the hostname to the group's members
field. If the machine is a new hostgroup / from a new provider, the hostgroup needs to be created, with the following template:
define hostgroup{
hostgroup_name <Provider_Name>
alias <Provider_Name>
members ,<Newly_Added_Machine_Hostname>
}
After the server definition has been made and host groups updated, syntax and the rest of the nagios config files can be checked by running /usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg
. If no errors have occurred, the service can be restarted by running sudo service nagios restart
and the machine should be viewable at nagios.adoptopenjdk.net
Note: More information about additional services such as check_http
, the package manager services, and passwd expiry
can be found in the Nagios: Monitoring Additional Services page.
If a client has ICMP disabled and they are 'unpingable', the ping
service will have to be removed from the server definition.
There are additional steps that are required to complete the client setup for systems where the Nagios server doesn't have direct access to them. (NAT'ed, Firewalled, etc) The following steps will configure a Reverse SSH Tunnel from the client system to the Nagios server allowing it to monitor the client across the tunnel. These steps assume the Client Configuration
section has been completed.
All of these commands should be done as the Nagios user, on the respective machines
-
Copy the Nagios_RemoteTunnel.sh script to the client machine at
~/Nagios_RemoteTunnel.sh
. -
Ensure the script is always running
crontab -e
* * * * * ~/Nagios_RemoteTunnel.sh
- Test the Reverse SSH Tunnel connection
# Change *PORTNUMBER* to the remote port number set in the script
/usr/local/nagios/libexec/check_by_ssh -H 127.0.0.1 -p *PORTNUMBER* -n lh -s c1:c2:c3 -C uptime -C uptime -C uptime
-
Manually update the server definition file at
/usr/local/nagios/etc/servers/*HOSTNAME*.cfg
, so theaddress
field instead has127.0.0.1 -p *PORTNUMBER*
-
Check the Nagios config and restart if no errors occur
/usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg
# If all is good restart nagios
service nagios restart
If all went well, the machine should be viewable at nagios.adoptopenjdk.net
If machines that are being monitored by Nagios are being decommissioned, they will have to be manually removed from Nagios.
Note: Machines shouldn't be removed from Nagios, until they have been removed from inventory.yml
/ have a PR to remove them from inventory.yml
.
On the Nagios Server, as the Nagios user:
# Remove the server definition
rm /usr/local/nagios/etc/servers/*HOSTNAME*.cfg
rm: remove write-protected regular file <*HOSTNAME*.cfg>? yes
When removing a machine, the /usr/local/nagios/etc/objects/hostgroups.cfg
file needs to be updated, to remove the machine. This is a case of finding the hostgroup of the machine (i.e. spearhead
,ibmcloud
,marist
) and removing the hostname from the members
field.
Once this is done, the Nagios configuration can be checked, and restarted if all is well:
# Alternatively `check_nagios` has been aliased to this command
/usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg
# If all is good restart nagios
sudo /etc/init.d/nagios restart
If you don't have access to the Nagios server to do these steps, please raise an issue with Nagios:
as a prefix, in the title.