Browsing posts in: nagios

custom Nagios alert in Ambari

The exercise here is to make a very simple Nagios plugin to be integrated in Ambari webUI.

We’ll check if the cluster is in safe mode or not, and put that alert into Ambari.

First let’s make the plugin, in the same directory you’ll find all scripts used by Ambari which you can duplicate and adapt.

[vagrant@gw ~]$ sudo vi /usr/lib64/nagios/plugins/check_safemode.sh
#!/bin/bash
ret=$(hadoop dfsadmin -safemode get)
if [[ $ret == *OFF ]]; then
echo "OK: $ret"
exit 0
fi
echo "KO : $ret"
exit 1

Notice that you have to echo something before every exit in the plugin, else Nagios will give you an alert.

Now define the command to execute the plugin :

[vagrant@gw ~]$ sudo vi /etc/nagios/objects/hadoop-commands.cfg

...
define command{
command_name check_safemode
command_line $USER1$/check_wrapper.sh $USER1$/check_safemode.sh -H $HOSTADDRESS$
}

Get the hostgroup name (/etc/nagios/objects/hadoop-hostgroups.cfg) in which the plugin will be executed, for example nagios-server (only one server since it’s a HDFS check !)

In /etc/nagios/objects/hadoop-servicegroups.cfg, get the service the plugin will run into.
Here, we’ll put this alert in the HDFS service.

Now the alert entry :

[vagrant@gw ~]$ sudo vi /etc/nagios/objects/hadoop-services.cfg
...
# NAGIOS SERVER HDFS Checks
...
define service {
hostgroup_name nagios-server
use hadoop-service
service_description HDFS::Is Cluster in Safe Mode
servicegroups HDFS
check_command check_safemode
normal_check_interval 2
retry_check_interval 1
max_check_attempts 1
}

Notice that normal_check_interval is minutes between checks.

Then restart Nagios :

[vagrant@gw ~]$ sudo service nagios restart

The alert will appears in Ambari :
nagios safe mode off

To test, let put the cluster in safe mode :

[vagrant@gw ~]$ sudo su hdfs
[hdfs@gw vagrant]$ kinit -kt /etc/security/keytabs/hdfs.headless.keytab hdfs
[hdfs@gw vagrant]$ hadoop dfsadmin -safemode enter
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.

Safe mode is ON

Now you’ll see in about a minute that the alert is on :

nagios safe mode on

Then you can leave safemode to be ok !

[hdfs@gw vagrant]$ hadoop dfsadmin -safemode leave
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.

Safe mode is OFF

Note that this is just for demonstration purpose : the plugin is not implementing Kerberos for example, like in the check_nodemanager_health plugin.

You may also note that Nagios is writing its output to the /var/nagios/status.dat file which is collected and read by Ambari to display its information.

Adapted from Hortonworks documentation