Installation d'une sonde Nagios pour ESXi
Contexte
Pour des besoins concernant l’hébergement que propose ma société, j'ai été amené à gérer un serveur ESXi, du coup, il n'y a pas de raison de ne pas le surveiller, je dirai même que c'est encore plus nécessaire ! On a vite tendance à tomber dans les pièges de la virtualisation qui consistent à charger le serveur avec beaucoup VM s'imaginant que celui-ci augmente ces performances au fur et à mesure de la charge ... :D
Prérequis
Installation des paquets nécessaires
# yum install openssl-devel binutils perl perl-Nagios-Plugin perl-Class-MethodMaker mod_perl libuuid uuid-perl perl-XML-LibXML perl-XML-LibXML-Common
Installation du vSphere SDK Perl
Vous téléchargez le tar.gz : VMware-vSphere-Perl-SDK-5.1.0-780721.x86_64.tar.gz
$ tar xvfz VMware-vSphere-Perl-SDK-5.1.0-780721.x86_64.tar.gz
$ cd vmware-vsphere-cli-distrib
Il y a 2 variables à changer afin de permettre sans encombre l'installation du SDK :
my $httpproxy =0;
my $ftpproxy =0;
par :
my $httpproxy =1;
my $ftpproxy =1;
# ./vmware-install.pl
Problème UUID
Si vous avez encore un soucis d'une dépendance non résolue avec UUID, alors effectuez ceci :
# yum install gcc
$ wget http://search.cpan.org/CPAN/authors/id/C/CF/CFABER/UUID-0.03.tar.gz
$ tar xvfz UUID-0.03.tar.gz
$ cd UUID-0.03
# perl Makefile.PL
# make
# make install
Puis relancez ./vmware-install.pl
Installation du plugin Nagios
Télécharger le plugin ici :http://www.op5.org/community/plugin-inventory/op5-projects/check-esx-plugin
$ cd /usr/local/nagios/libexec/
$ wget http://git.op5.org/git/?p=nagios/op5plugins.git;a=blob_plain;f=check_vmware_api.pl;hb=HEAD
# chown nagios:nagios check_vmware_api.pl
# chmod 755 check_vmware_api.pl
Lançons la commande une première fois et nous obtenons ceci :
$ ./check_vmware_api.pl --help
check_vmware_api.pl 0.7.0
This nagios plugin is free software, and comes with ABSOLUTELY NO WARRANTY.
It may be used, redistributed and/or modified under the terms of the GNU
General Public Licence (see http://www.fsf.org/licensing/licenses/gpl.txt).
VMWare Infrastructure plugin
Usage: check_vmware_api.pl -D <data_center> | -H <host_name> [ -C <cluster_name> ] [ -N <vm_name> ]
-u <user> -p <pass> | -f <authfile>
-l <command> [ -s <subcommand> ] [ -T <timeshift> ] [ -i <interval> ]
[ -x <black_list> ] [ -o <additional_options> ]
[ -t <timeout> ] [ -w <warn_range> ] [ -c <crit_range> ]
[ -V ] [ -h ]
-?, --usage
Print usage information
-h, --help
Print detailed help screen
-V, --version
Print version information
--extra-opts=[section][@file]
Read options from an ini file. See http://nagiosplugins.org/extra-opts
for usage and examples.
-H, --host=<hostname>
ESX or ESXi hostname.
-C, --cluster=<clustername>
ESX or ESXi clustername.
-D, --datacenter=<DCname>
Datacenter hostname.
-N, --name=<vmname>
Virtual machine name.
-u, --username=<username>
Username to connect with.
-p, --password=<password>
Password to use with the username.
-f, --authfile=<path>
Authentication file with login and password. File syntax :
username=<login>
password=<password>
-w, --warning=THRESHOLD
Warning threshold. See
http://nagiosplug.sourceforge.net/developer-guidelines.html#THRESHOLDFORMAT
for the threshold format.
-c, --critical=THRESHOLD
Critical threshold. See
http://nagiosplug.sourceforge.net/developer-guidelines.html#THRESHOLDFORMAT
for the threshold format.
-l, --command=COMMAND
Specify command type (CPU, MEM, NET, IO, VMFS, RUNTIME, ...)
-s, --subcommand=SUBCOMMAND
Specify subcommand
-S, --sessionfile=SESSIONFILE
Specify a filename to store sessions for faster authentication
-x, --exclude=<black_list>
Specify black list
-o, --options=<additional_options>
Specify additional command options (quickstats, ...)
-T, --timestamp=<timeshift>
Timeshift in seconds that could fix issues with "Unknown error". Use values like 5, 10, 20, etc
-i, --interval=<sampling period>
Sampling Period in seconds. Basic historic intervals: 300, 1800, 7200 or 86400. See config for any changes.
Supports literval values to autonegotiate interval value: r - realtime interval, h<number> - historical interval specified by position.
Default value is 20 (realtime). Since cluster does not have realtime stats interval other than 20(default realtime) is mandatory.
-M, --maxsamples=<max sample count>
Maximum number of samples to retrieve. Max sample number is ignored for historic intervals.
Default value is 1 (latest available sample).
--trace=<level>
Set verbosity level of vSphere API request/respond trace
-t, --timeout=INTEGER
Seconds before plugin times out (default: 30)
-v, --verbose
Show details for command-line debugging (can repeat up to 3 times)
Supported commands(^ - blank or not specified parameter, o - options, T - timeshift value, b - blacklist) :
VM specific :
* cpu - shows cpu info
+ usage - CPU usage in percentage
+ usagemhz - CPU usage in MHz
+ wait - CPU wait time in ms
+ ready - CPU ready time in ms
^ all cpu info(no thresholds)
* mem - shows mem info
+ usage - mem usage in percentage
+ usagemb - mem usage in MB
+ swap - swap mem usage in MB
+ swapin - swapin mem usage in MB
+ swapout - swapout mem usage in MB
+ overhead - additional mem used by VM Server in MB
+ overall - overall mem used by VM Server in MB
+ active - active mem usage in MB
+ memctl - mem used by VM memory control driver(vmmemctl) that controls ballooning
^ all mem info(except overall and no thresholds)
* net - shows net info
+ usage - overall network usage in KBps(Kilobytes per Second)
+ receive - receive in KBps(Kilobytes per Second)
+ send - send in KBps(Kilobytes per Second)
^ all net info(except usage and no thresholds)
* io - shows disk I/O info
+ usage - overall disk usage in MB/s
+ read - read latency in ms (totalReadLatency.average)
+ write - write latency in ms (totalWriteLatency.average)
^ all disk io info(no thresholds)
* runtime - shows runtime info
+ con - connection state
+ cpu - allocated CPU in MHz
+ mem - allocated mem in MB
+ state - virtual machine state (UP, DOWN, SUSPENDED)
+ status - overall object status (gray/green/red/yellow)
+ consoleconnections - console connections to VM
+ guest - guest OS status, needs VMware Tools
+ tools - VMWare Tools status
+ issues - all issues for the host
^ all runtime info(except con and no thresholds)
Host specific :
* cpu - shows cpu info
+ usage - CPU usage in percentage
o quickstats - switch for query either PerfCounter values or Runtime info
+ usagemhz - CPU usage in MHz
o quickstats - switch for query either PerfCounter values or Runtime info
^ all cpu info
o quickstats - switch for query either PerfCounter values or Runtime info
* mem - shows mem info
+ usage - mem usage in percentage
o quickstats - switch for query either PerfCounter values or Runtime info
+ usagemb - mem usage in MB
o quickstats - switch for query either PerfCounter values or Runtime info
+ swap - swap mem usage in MB
o listvm - turn on/off output list of swapping VM's
+ overhead - additional mem used by VM Server in MB
+ overall - overall mem used by VM Server in MB
+ memctl - mem used by VM memory control driver(vmmemctl) that controls ballooning
o listvm - turn on/off output list of ballooning VM's
^ all mem info(except overall and no thresholds)
* net - shows net info
+ usage - overall network usage in KBps(Kilobytes per Second)
+ receive - receive in KBps(Kilobytes per Second)
+ send - send in KBps(Kilobytes per Second)
+ nic - makes sure all active NICs are plugged in
^ all net info(except usage and no thresholds)
* io - shows disk io info
+ aborted - aborted commands count
+ resets - bus resets count
+ read - read latency in ms (totalReadLatency.average)
+ write - write latency in ms (totalWriteLatency.average)
+ kernel - kernel latency in ms
+ device - device latency in ms
+ queue - queue latency in ms
^ all disk io info
* vmfs - shows Datastore info
+ (name) - free space info for datastore with name (name)
o used - output used space instead of free
o breif - list only alerting volumes
o regexp - whether to treat name as regexp
o blacklistregexp - whether to treat blacklist as regexp
b - blacklist VMFS's
T (value) - timeshift to detemine if we need to refresh
^ all datastore info
o used - output used space instead of free
o breif - list only alerting volumes
o blacklistregexp - whether to treat blacklist as regexp
b - blacklist VMFS's
T (value) - timeshift to detemine if we need to refresh
* runtime - shows runtime info
+ con - connection state
+ health - checks cpu/storage/memory/sensor status
o listitems - list all available sensors(use for listing purpose only)
o blackregexpflag - whether to treat blacklist as regexp
b - blacklist status objects
+ storagehealth - storage status check
o blackregexpflag - whether to treat blacklist as regexp
b - blacklist status objects
+ temperature - temperature sensors
o blackregexpflag - whether to treat blacklist as regexp
b - blacklist status objects
+ sensor - threshold specified sensor
+ maintenance - shows whether host is in maintenance mode
+ list(vm) - list of VMWare machines and their statuses
+ status - overall object status (gray/green/red/yellow)
+ issues - all issues for the host
b - blacklist issues
^ all runtime info(health, storagehealth, temperature and sensor are represented as one value and no thresholds)
* service - shows Host service info
+ (names) - check the state of one or several services specified by (names), syntax for (names):<service1>,<service2>,...,<serviceN>
^ show all services
* storage - shows Host storage info
+ adapter - list bus adapters
b - blacklist adapters
+ lun - list SCSI logical units
b - blacklist LUN's
+ path - list logical unit paths
b - blacklist paths
^ show all storage info
* uptime - shows Host uptime
o quickstats - switch for query either PerfCounter values or Runtime info
* device - shows Host specific device info
+ cd/dvd - list vm's with attached cd/dvd drives
o listall - list all available devices(use for listing purpose only)
DC specific :
* cpu - shows cpu info
+ usage - CPU usage in percentage
o quickstats - switch for query either PerfCounter values or Runtime info
+ usagemhz - CPU usage in MHz
o quickstats - switch for query either PerfCounter values or Runtime info
^ all cpu info
o quickstats - switch for query either PerfCounter values or Runtime info
* mem - shows mem info
+ usage - mem usage in percentage
o quickstats - switch for query either PerfCounter values or Runtime info
+ usagemb - mem usage in MB
o quickstats - switch for query either PerfCounter values or Runtime info
+ swap - swap mem usage in MB
+ overhead - additional mem used by VM Server in MB
+ overall - overall mem used by VM Server in MB
+ memctl - mem used by VM memory control driver(vmmemctl) that controls ballooning
^ all mem info(except overall and no thresholds)
* net - shows net info
+ usage - overall network usage in KBps(Kilobytes per Second)
+ receive - receive in KBps(Kilobytes per Second)
+ send - send in KBps(Kilobytes per Second)
^ all net info(except usage and no thresholds)
* io - shows disk io info
+ aborted - aborted commands count
+ resets - bus resets count
+ read - read latency in ms (totalReadLatency.average)
+ write - write latency in ms (totalWriteLatency.average)
+ kernel - kernel latency in ms
+ device - device latency in ms
+ queue - queue latency in ms
^ all disk io info
* vmfs - shows Datastore info
+ (name) - free space info for datastore with name (name)
o used - output used space instead of free
o breif - list only alerting volumes
o regexp - whether to treat name as regexp
o blacklistregexp - whether to treat blacklist as regexp
b - blacklist VMFS's
T (value) - timeshift to detemine if we need to refresh
^ all datastore info
o used - output used space instead of free
o breif - list only alerting volumes
o blacklistregexp - whether to treat blacklist as regexp
b - blacklist VMFS's
T (value) - timeshift to detemine if we need to refresh
* runtime - shows runtime info
+ list(vm) - list of VMWare machines and their statuses
+ listhost - list of VMWare esx host servers and their statuses
+ listcluster - list of VMWare clusters and their statuses
+ tools - VMWare Tools status
b - blacklist VM's
+ status - overall object status (gray/green/red/yellow)
+ issues - all issues for the host
b - blacklist issues
^ all runtime info(except cluster and tools and no thresholds)
* recommendations - shows recommendations for cluster
+ (name) - recommendations for cluster with name (name)
^ all clusters recommendations
Cluster specific :
* cpu - shows cpu info
+ usage - CPU usage in percentage
+ usagemhz - CPU usage in MHz
^ all cpu info
* mem - shows mem info
+ usage - mem usage in percentage
+ usagemb - mem usage in MB
+ swap - swap mem usage in MB
o listvm - turn on/off output list of swapping VM's
+ memctl - mem used by VM memory control driver(vmmemctl) that controls ballooning
o listvm - turn on/off output list of ballooning VM's
^ all mem info(plus overhead and no thresholds)
* cluster - shows cluster services info
+ effectivecpu - total available cpu resources of all hosts within cluster
+ effectivemem - total amount of machine memory of all hosts in the cluster
+ failover - VMWare HA number of failures that can be tolerated
+ cpufainess - fairness of distributed cpu resource allocation
+ memfainess - fairness of distributed mem resource allocation
^ only effectivecpu and effectivemem values for cluster services
* runtime - shows runtime info
+ list(vm) - list of VMWare machines in cluster and their statuses
+ listhost - list of VMWare esx host servers in cluster and their statuses
+ status - overall cluster status (gray/green/red/yellow)
+ issues - all issues for the cluster
b - blacklist issues
^ all cluster runtime info
* vmfs - shows Datastore info
+ (name) - free space info for datastore with name (name)
o used - output used space instead of free
o breif - list only alerting volumes
o regexp - whether to treat name as regexp
o blacklistregexp - whether to treat blacklist as regexp
b - blacklist VMFS's
T (value) - timeshift to detemine if we need to refresh
^ all datastore info
o used - output used space instead of free
o breif - list only alerting volumes
o blacklistregexp - whether to treat blacklist as regexp
b - blacklist VMFS's
T (value) - timeshift to detemine if we need to refresh
Copyright (c) 2008 op5
Après un test rapide, nous obtenons une erreur de ce type :
CHECK_VMWARE_API.PL CRITICAL - Server version unavailable at ...
La vérification du certificat pose problème, si vous ne voulez pas le passer en paramètre, utiliser cette option :
--no-certificate-checking
ou rajoutez ceci au début du script perl :
$ENV{PERL_LWP_SSL_VERIFY_HOSTNAME} = 0;
Configuration de Nagios
Nous allons stocker les identifiants de connexions de l'ESXi dans le fichiers etc/resource.cfg qui ne doit pas être accessible via les CGI
$USER09$=username
$USER10$=password
Ensuite reste à configurer les commandes :
# 'check_esx_cpu' command definition
define command{
command_name check_esx_cpu
command_line $USER1$/check_vmware_api.pl -H $HOSTADDRESS$ -u $USER09$ -p $USER10$ -l cpu -s usage -w $ARG1$ -c $ARG2$
}
# 'check_esx_mem' command definition
define command{
command_name check_esx_mem
command_line $USER1$/check_vmware_api.pl -H $HOSTADDRESS$ -u $USER09$ -p $USER10$ -l mem -s usage -w $ARG1$ -c $ARG2$
}
# 'check_esx_net' command definition
define command{
command_name check_esx_net
command_line $USER1$/check_vmware_api.pl -H $HOSTADDRESS$ -u $USER09$ -p $USER10$ -l net -s usage -w $ARG1$ -c $ARG2$
}
# 'check_esx_runtime' command definition
define command{
command_name check_esx_runtime
command_line $USER1$/check_vmware_api.pl -H $HOSTADDRESS$ -u $USER09$ -p $USER10$ -l runtime -s status
}
# 'check_esx_ioread' command definition
define command{
command_name check_esx_ioread
command_line $USER1$/check_vmware_api.pl -H $HOSTADDRESS$ -u $USER09$ -p $USER10$ -l io -s read -w $ARG1$ -c $ARG2$
}
# 'check_esx_iowrite' command definition
define command{
command_name check_esx_iowrite
command_line $USER1$/check_vmware_api.pl -H $HOSTADDRESS$ -u $USER09$ -p $USER10$ -l io -s write -w $ARG1$ -c $ARG2$
}
Puis la traditionnelle configuration :
define host{
use generic-host
host_name myesx1
alias myesx1
address XXX.XXX.XXX.XXX
}
Et la définition des services :
define service{
use generic-service
host_name myesx1
service_description ESXi CPU Load
check_command check_esx_cpu!80!90
}
define service{
use generic-service
host_name myesx1
service_description ESXi Memory usage
check_command check_esx_mem!80!90
}
define service{
use generic-service
host_name myesx1
service_description ESXi Network usage
check_command check_esx_net!102400!204800
}
define service{
use generic-service
host_name myesx1
service_description ESXi Runtime status
check_command check_esx_runtime
}
define service{
use generic-service
host_name myesx1
service_description ESXi IO read
check_command check_esx_ioread!40!90
}
define service{
use generic-service
host_name myesx1
service_description ESXi IO write
check_command check_esx_iowrite!40!90
}
Conclusion
Voilà, le tour est joué, vous avez un début de supervision de votre serveur ESX ! Pour avoir un monitoring plus fin, je vous invite à parcourir cette documentation : http://www.op5.com/how-to/monitoring-vmware-esx-3-x-esxi-vsphere-4-and-vcenter-server