Distributed monitoring with Nagios and Puppet
In the past I had only one Nagios3 server to monitor all my production servers. The configuration was
fully generated with Puppet by Naginator.
This solution even with drawbacks (hard to set specific alert thresholds, appliances without Puppet, etc…)
is very powerfull. I never had to mind about monitoring configuration :
I’m always sure that every host in production is monitored by nagios thanks to Puppet.
However my needs have evolved and I begun to have distributed monitoring problems :
4 datacenters spread between Europe and USA and networks outages between datacenters
raising a lot of False Positives alerts.
I didn’t have any performance isssues as I have less than 200 hosts and 2K services.
I tried Shinken, really I tried. 2 years ago and again this last few months.
I had to package it into Debian package because all of our servers are
built unattended : the installation script was not an option for me.
On the paper Shinekn was perfect :
* fully compatible with Nagios configration
* support of specific shinken parameters on Puppet (ie: poller_tag)
* built-in support of distributed monitoring with realms and poller_tag
* built-in suport of HA
* built-in support of Livestatus
* very nice community and developers
In my experience :
* the configuration was not fully compatible (but adjustments were easy)
* shinken takes a lot more RAM than Nagios (even if Jean Gabès took the time to write me a very long mail to explain this behavior)
* the most important to me : the whole set was IMHO not enough stable and robust for my use-case : in case of netsplit, daemons did not resynchronize after outage, some modules crashed quite often without explanation, some problems with Pyro, etc ..
At the end I was not confident enough about my monitoring POC and I did not choose to put it in production.
To be clear :
* I still believe that Shinken will be in (near ?) future one (or THE) solution to replace the old Nagios, but it was not ready
for my needs
* Some people are running Shinken in production and some on very big production without any problem. My experience should not
convince you not to try this product ! You need to make your own opinion !
In deciding not to use Shinken I had to find another solution.
I choose this architecture :
* One Nagios per datacenter for polling
* Puppet to manage the whole distributed configuration (it take there the aim of arbiter in shinken)
* Livestatus + Check_MK Multisite to aggregate views of monitoring from all datacenters
Puppet tricks
We use a lot of Facts custom in Puppet and we have a Fact “$::hosting”
wich let us know in which datacenter is the host.
In order to cut our monitoring configuration between each poller, I use dynamic target for all puppet’s resources bounded to datacenters (hosts, services, hostescalation, serviceescalation):
Here is a simplified example of Host configuration in Exported Resources :
$puppet_conf_directory = '/etc/nagios3/conf.puppet.d' $host_directory = "$puppet_conf_directory/$::hosting" @@nagios_host { "$::fqdn" : tag => 'nagios', address => $::ipaddress_private, alias => $::hostname, use => "generic-host", notify => Service[nagios3], target => "$host_directory/$::fqdn-host.cfg", }
All common resources between every pollers (contacts, contactgroups,
commands, timeperiods, etc…) are generated in one common directory
that all nagios pollers are reading (ie: ‘/etc/nagios3/conf.puppet.d/common’).
Eventually in nagios.cfg, I read for each poller the good directories for each datacenters.
# ie for nagios1 : cfg_dir=/etc/nagios3/conf.puppet.d/common cfg_dir=/etc/nagios3/conf.puppet.d/hosting1 # for nagios2 : cfg_dir=/etc/nagios3/conf.puppet.d/common cfg_dir=/etc/nagios3/conf.puppet.d/hosting2
I decided not to use tags in exported resources :
it let me have exact same configuration files on each nagios poller in “/etc/nagios3/conf.puppet.d” : only change nagios.cfg between pollers.
In case of problem on one of my poller, I can very simply add monitoring of one datacenter only adding one more directory to source in nagios.cfg.
With that configuration I have a very simple distributed monitoring thanks to Puppet once again
I will explain in one future blog post how to aggregate views with Livestatus and Check_MK Multisite.