Rule-based Configuration (ncgx)

ncgx is a rule based engine that generates nagios-core compatible object configurations. It’s similar to check_mk cmk, but supports generating configuration from a python API (via plugins) as well as from a set of manually defined configs. It’s build around the following basic concepts: metrics, hosts and checks.

Hosts define hostnames of the machines that wil be monitored and can be grouped via arbitrary tags. The following would be a sample definition introducing two hosts (host1, host2) and few sample tags:

 hosts = { 
   'host1.domain' : ['tag1', 'tag2', .. ],
   'host2.domain' : ['tag1', 'tag3', .. ],
   }

Metrics are templates that define how Nagios plugins will be executed to take a particular measurement. The following rule defines a metric called LHCb feed, which will execute check_http Nagios plugin to determine accessibility of a given URL specified via arguments. The plugin will be executed once an hour (every 60 minutes) and upon failure it will be executed every 5 minutes.

 metrics = {
   "LHCb feed" : {
        "command" : "/usr/bin/check_http",
        "args" : {
            "-u" : "/some/path/file.xml", 
            "-H" : "somehost.cern.ch", 
            "-s" : '"LHCb Topology Information"', # string to lookup in xml 
            "--onredirect" : "follow"
        },
        "interval" : 60,
        "retry_interval" : 5
   }
 }

You can define any command that follows the Nagios plugins standard in this way. For a list of existing plugins and their arguments, please refer to Plugins overview. Definition of the supported directives (e.g. interval, retry_interval) can be found in the section Metric Definitions. ncgx comes with a set of pre-defined metrics for several common plugins, which can be found in /etc/ncgx/metrics.d/.

Checks are used to connect host and metrics, they determine what metrics will be executed on which hosts. The following rule will run LHCb feed metric against all hosts tagged with tag1 as well as directly on localhost e.g.

  checks = [
       ['tag1', 'LHCb feed'],
       ['localhost', 'LHCb feed'],
  ]

ncgx plugins can be used to add the rules (as were described above) through code using python API. The corresponding plugin for the above mentioned rules would look like this:

 from ncgx.inventory import Hosts, Metrics, Checks 
 
 # entry point of any module is a run function with any number of arguments 
 # (see Configuration for details)
 def run(args):
     h = Hosts()
     h.add('host1.domain', tags=['tag1', 'tag2']) # tag host1.domain with tag1, tag2
     h.add('host2.domain', tags=['tag1', 'tag3'])
     h.serialize(fname='/tmp/hosts.cfg') # write the corresponding rules to file /tmp/hosts.cfg
     
     m = Metrics()
     m.add('LHCb feed', { "command" : "/usr/bin/check_http", "args" : { ... } })
     m.serialize(fname='/tmp/metrics.cfg')
     
     c = Checks()
     c.add('LHCb feed', tags=['tag1',], hosts=['localhost',])
     c.serialize(fname='/tmp/check.cfg')

Invocation

ncgx reads the rules (metrics, checks and hosts) from a set of cfg files that are usually stored in /etc/ncgx/conf.d, any number of cfg files are supported and each can contain any set of rules (metrics, hosts, etc.). cfg files are regular python script, thus any syntax supported by python can be introduced (note that function will not be evaluated). In addition, there can be any number of ncgx plugins executed that can generate additional cfg files, plugins are usually stored in /usr/lib/ncgx/x_plugins and configured to be executed in the main ncgx configuration file (/etc/ncgx/ncgx.cfg for details see Configuration section).

Once invoked from the command line ncgx will first find and execute all the configured plugins, afterwards it reads all the cfg files (both generated and manually added) and compiles them into an inventory. Inventory file is serialized into a JSON-formatted file, which usually resides in /var/cache/ncgx. Once inventory is generated, post-processing scripts are executed and finally the backend template engine is called to process the inventory and generate the nagios-core compatible configuration.

Command Line Options

usage: ncgx [-h] [--version] [-c CONFIG] [--nagios-path NAGIOS_PATH] [-d] [-q]
            [-l] [-P] [--skip-plugins] [-I] [-T INVENTORY]
            [--inventory-path INVENTORY_PATH] [-t TAG] [-N] [-H HOST]
            [-m METRIC] [--explain] [-S] [-s] [--no-host-def]
            [--skip-host-def SKIP_HOSTS] [--ipv6]

ncgx - nagios-core configuration utility for synthetic/functional testing

  -h, --help            show this help message and exit
  --version             show program's version number and exit
  -c CONFIG, --config CONFIG
                        Specify path of the configuration files (defaults to
                        /etc/ncgx)
  --nagios-path NAGIOS_PATH
                        Path where generated Nagios configuration should be
                        stored (defaults to /etc/nagios/conf.d)
  -d, --debug           Specify debugging mode
  -q, --quiet           Redirects to syslog and restricts output to errors
  -l, --log             Adds standard logging headers to stdout
  -P, --plugins-only    Calls all configured plugins to generate local config
  --skip-plugins        Skips executing plugins
  -I, --inventory-only  Calls all configured plugins, compiles and generates
                        inventory
  -T INVENTORY, --inventory INVENTORY
                        Generates Nagios-core configuration from an existing
                        inventory file
  --inventory-path INVENTORY_PATH
                        Specify path where inventory should be stored
  -t TAG, --tag TAG     List all hosts for given tag
  -N, --nagios          Dumps Nagios configuration for given host (needs
                        --host)
  -H HOST, --host HOST  Specify host (implies --nagios)
  -m METRIC, --metric METRIC
                        Specify metric (implies --simulate)
  --explain             Shows final metric definition for given metric
  -S, --simulate        Shows command Nagios will execute for given host and
                        metric (needs --host and --metric)
  -s, --safe            Attempts to validate generated configuration and if it
                        fails, keeps the existing one (requires valid
                        nagios.cfg under generic templates)
  --no-host-def         Skip generation of the host definitions
  --skip-host-def SKIP_HOSTS
                        Skip generation of host defs for a given list of hosts
                        (comma separated)
  --ipv6                Enable IPv6 configuration tweaks

Configuration

ncgx has a single configuration file usually stored in /etc/ncgx/ncgx.cfg. The following options are recognized:

# Determines what plugins should be executed in the plugins stage
# Plugins are run in a random sequence, arguments are passed key/value pairs 
# to function run(arg1, arg2, ..)
PLUGINS = { "plugin_module1" : { "arg1" : "val1", "arg2" : val2" },
            "plugin_module2 : {"arg1" : "val1" }

# Defines what inventory post-processing scripts will be executed
# available scripts can be found in the ncgx python directory
SCRIPTS = [ 'localhost', 'unique_tag']

# Determines if host and service groups should be generated (see Host and Service Groups section)
DEFINE_HOSTGROUPS = True
DEFINE_SERVICEGROUPS = True

# Determines if contacts will be generated (see Contacts and Notifications) 
DEFINE_CONTACTS = True

# Auto-define passive metrics (see Metric definitions for details)
AUTO_PASSIVE = True

# Any other remaining attributes are auto-propagated to the metric templates 
# (see Metric definitions for details)

Host and Service Groups

In addition to tags, ncgx supports generation of the host and service groups that can be used to group hosts and metrics in different ways in order to provide specific visualizations in the web interfaces. The following are sample rules, that will create hostgroup1 with already defined hosts and a service group that will contain any metric starting containing string “LHCb”:

host_groups = [
          ["hostgroup1", ['host1.domain', 'host2.domain']], ...
]
          
service_groups = [
          ["sg1", ["LHCb"]],
]

The same via plugin API:

from ncgx.inventory import Groups

hg = Groups("host_groups")
hg.add("hostgroup1", "host1.domain")
hg.add("hostgroup1", "host2.domain")

sg = Groups("service_groups")
sg.add("MJF", "org.lhcb.WN-mjf")
sg.serialize()

Metric Definitions

Since defining metrics can be quite tedious, there are many different advanced features to help with complex definitions. Metric definitions support one-level inheritance, so an initial template can be defined and later extended, e.g.

# The following definition shows how LHCb feed metric can be defined based on 
# check_http template just by adding additional command line arguments
 metrics = {
   "check_http" : {
        "command" : "/opt/omd/versions/default/lib/nagios/plugins/check_http",
        "args" : {
        },
        "interval" : 60,
        "retry_interval" : 5,
        "max_check_attempts" : 3,
        "ocsp" : 0,
        "nagios_command" : "ncgx_check",
   },
   "LHCb feed" : {
        "extends" : "check_http",
        "args" : {
            "-u" : "/some/path/file.xml", 
            "-H" : "somehost.cern.ch", 
            "-s" : '"LHCb Topology Information"', # string to lookup in xml 
            "--onredirect" : "follow",
        }
   }
}

Metrics can also be defined directly via inheritance in checks, which makes it possible to tune metric definition per host and per metric, e.g.

# 
checks = [
    [ "tag1", "CMS feed", { "extends" : "check_htpp", 
                            "args" : { "-u": "/some/path", ... } } ],
    # runs CMS feed with given args on all hosts tagged with tag1
    but also
    [ "host1", "CMS feed", { "extends" : "check_http" , 
                                       "args" : { "-u": "/some/host1/specific/path, ..} } ],
    # runs CMS feed with given args on host1
]

The same can be done in a plugin via API, e.g.

from ncgx.inventory import Checks

checks = Checks()
checks.add("CMS feed", "tag1", params={ "extends" : "check_htpp",
                                   "args" : { "-u": "/some/path", ... } })

So far we have only seen active metrics, metrics that contain an actual executable, but there are cases when one executable can set multiple metrics, so called passive metrics. A sample definition of a passive metric is:

   "p_metric" : {
      "passive" : True,
   },

Since passive metrics usually require an underlying active metric that will set their state, ncgx allows to define dependencies on metrics, e.g.

   "p_metric" : {
      "passive" : True,
      "depends" : "metric1 metric2 metric3" # if p_metric is in checks, this clause will 
                                            # add metrics 1-3 to the 
                                            # inventory for the same host as well
   },

In order to avoid numerous passive definitions, ncgx can auto-define passive metrics via AUTO_PASSIVE = True clause in the ncgx main configuration. With auto-passive on, any metric not defined is considered a passive metric.

Variables are supported in metric definitions, via :: clause, e.g.

   "metric" : {
      "args" : {
          "-x" : "::X509_USER_PROXY_PROD" # attibute X509_USER_PROXY_PROD will be 
                                          # looked up in /etc/ncgx/ncgx.cfg and
                                          # its value it will be used here 
      }
   }

It’s also possible to use native nagios-core command macros in place, e.g.

"--prev-status" : "$LASTSERVICESTATEID$"   # see Nagios documentation on existin macros

Finally, the following additional metric directives are recognized.

  • interval - This directive is used to define the number of “time units” to wait before scheduling the next “regular” check of the service. “Regular” checks are those that occur when the service is in an OK state or when the service is in a non-OK state, but has already been rechecked max_check_attempts number of times.
  • retry_interval - This directive is used to define the number of “time units” to wait before scheduling a re-check of the service. Services are rescheduled at the retry interval when they have changed to a non-OK state. Once the service has been retried max_check_attempts times without a change in its status, it will revert to being scheduled at its “normal” rate as defined by the check_interval value.
  • max_check_attempts - This directive is used to define the number of times that Nagios will retry the service check command if it returns any state other than an OK state. Setting this value to 1 will cause Nagios to generate an alert without retrying the service check again.
  • docurl - This directive is used to define an optional URL that can be used to provide more information about the service.
  • doc - This directive is used to define an optional string of notes pertaining to the service. If you specify a note here, you will see the it in the extended information CGI (when you are viewing information about the specified service).
  • ocsp - This directive determines whether or not checks will be “obsessed” over using the ocsp_command (OCSP stands for obsessive compulsive processor - see Nagios documentation - and is used primarily to stream metric results to the nagios stream.
  • nagios_command - This directive can be used to directly indicate which Nagios command should be used.

For a detailed documentation on all the existing directives supported by the nagios-core please check Nagios Documentation

Contacts and Notifications

Contacts and notifications can be used to issue an alert to a site in case particular metric fails. Contacts need to be enabled in the main configuration file and can be define in cfg file as follows:

contacts = [
   [ "contact@domain.ext", "<metric_pattern>", "<host_pattern> or <host_group>", 
             { "option1" : "val1", .. } ],
]
# this will notify feed_test@cern for any issues with metrics containing the word "feed" on all host with "cern.ch" domain
# in addition test@cern will be notified of any issues for all hosts in the hostgroup T2_US_TEST
contacts = [
   [ "feed_test@cern", "feed", "cern.ch", ],
   [ "test@cern", "", "T2_US_TEST", ],
]

The following contact options are supported:

  • notification_options - this directive is used to determine when notifications for the service should be sent out. Valid options are a combination of one or more of the following: w = send notifications on a WARNING state, u = send notifications on an UNKNOWN state, c = send notifications on a CRITICAL state, r = send notifications on recoveries (OK state), f = send notifications when the service starts and stops flapping, and s = send notifications when scheduled downtime starts and ends. If you specify n (none) as an option, no service notifications will be sent out. If you do not specify any notification options, Nagios will assume that you want notifications to be sent out for all possible states. Example: If you specify w,r in this field, notifications will only be sent out when the service goes into a WARNING state and when it recovers from a WARNING state. Default is C (so critical states only).
  • notification_interval - This directive is used to define the number of “time units” to wait before re-notifying a contact that this service is still in a non-OK state. Unless you’ve changed the interval_length directive from the default value of 60, this number will mean minutes. If you set this value to 0, Nagios will not re-notify contacts about problems for this service - only one problem notification will be sent out. Default is 60 minutes.

Template Engine

ncgx has a simple template engine that used internally to generate the actual nagios-core configuration files. The template engine can be tweaked if needed. Templates are usually stored in /etc/ncgx/templates and engine supports functions and variables. Variables are automatically evaluated from inventory. Functions are called from the template function directory (ncgx python directory templates/functions): A sample host object definition can be defined via template:

define host {
   use             ncgx-generic-host
   host_name       ::hostname  # maps to real hostname as defined in the inventory
   alias           ::hostname
   address         ::ip  # maps to real IP address as defined in the inventory
   check_command   ncgx_check_host_alive
   ::t_function(host_groups) # calls templates function host_groups to generate host_groups definition
}

Templates can be used to make global changes or tune behaviour for particular directives.

Any template function defined in the ncgx python directory can be called, providing it accepts the following arguments:

  • line - string that will contain the current line processed in the template
  • base_config - object that stores variables from the ncgx main configuration file
  • context - inventory object for the current host being
def template_function(line, base_config, context):
    if "host_groups" in context.keys() and context['host_groups']:
        return "hostgroups       "+",".join(context['host_groups'])
    else:
        return " "

Finally, any files stored under /etc/ncgx/templates/generic will be simply copied over to the target nagios-core directory.