Blog

Icinga2 - API and Passive Checks

Icinga2 - API and Passive Checks

The icinga2 configuration options are rich and provide you with a lot of ways to re-use what you already have. They serve well, but once your monitoring becomes big enough they just don't cut it anymore. Reducing the load on the monitoring hosts and the servers you check becomes a high priority. One way to solve this is to use the API and create passive services where applicable.

At the time of writing this blog there isn't enough information for people who are trying to learn what passive checks are and how to implement them. I'm writing this with hope that I will make it easier for people who want to know how to get started.







Passive Checks


In their essence these are nothing more than services that don't trigger any commands. Instead they rely on getting their states, results and performance data injected from another source. The best way to this right now is by using the Icinga API.

The most common use case is when you're able to gather information from a large amount of servers or services by running a single script. This script can either be ran by icinga as a normal check command which returns combined state for all the services, or it can be triggered in any other way.

The plugin gathers data for each service it checks and then injects it directly in an existing service using the Icinga API.

A passive service would either use the "dummy" command and have "active_checks" disabled or have them enabled and rely on the dummy default text output which should only be served if the API injection is delayed and exceeds the check interval for the service.


In order to avoid the frustration of manually creating each service before importing its state, you can use the API for spawning services as well. They can be assigned to any existing host, or if you desire the, host(s) they are assigned to can also be created using the API.

I don't recommend mixing API and manual configuration on the same host objects, unless you have implemented a good logic for discovering and getting rid of obsolete(zombie) services that no longer exist and therefore don't receive status updates. One way to do this is to use the API and filter and remove services with the 'dummy' default output, which should be present when the service no longer receives external injections.

If your host only contains API-generated services it can be easily removed together with everything attached to it and then rebuilt the next time your script is triggered.



Icinga API

The icinga API is a powerful tool which provides access to the entire icinga configuration, objects and states. Most of the examples found online use the linux shell command curl. If shell scripts suit you, that is the simplest way of interacting with it, however, I'll provide examples for both python and curl that do the same thing. Python is more flexible and it also runs on windows, which unfortunately is required sometimes.

Examples for other languages can be found here

Creating a host

curl


python


Creating a Service

curl


python


Injecting the state of a service

curl


python


Importing PerfData varaibles is easily done with an array. Otherwise Python doesn't parse the required quotes correctly whcih leads to icigna not recognizing the Data. Here is an example of how that's done:


Deleting a host or service and everything attached to it recursively

curl


python


Getting the current state of a service

curl


python




Automation and Autodiscovery

With the ability to create, destroy and update objects automatically the Icinga API allows the implementation of flexible automation. A well thought out script can make services appear and disappear seemlessly as they are implemented or removed from your environment.

The Autodiscovery is only limited by the copmplexity of the script you use.

Here I will show an example for listing all the services on a specific host, checking their current state and output, and if both the state and the output match with the ones set by the dummy command - Deleting them.

Normally the dummy command should only give it's own output if the external injection is delayed longer than the check_interval of the service.

python




Performance Data

Every good monitoring goes hand in hand with a well organized Graphic view of the event history. To achieve this all your monitoring scripts should return properly structured performance data values. The usual monitoring script output contains a text message and a check status. For example(python):


This output, however, contains no performance data. It can be used for basic checks, but if you ever need to dig in the history of what happened to the service in the past day/month/year it just isn't very useful.

In Icinga2 performance data is delivered by the printed output message. After the status text there needs to be a pipe symbol | and everything after it is considered by icinga as performance data.

Here is an example of the richest way of providing performance data to the check result.


Everything before the '=' is the name of the value. For example 'disk_space_percent'

The first field after the '=' is meant to be the current state of the service always followed by ';'. That's the least amount of information required to receive performance data which can be interpreted by the monitoring. Additionally you can add the type of value to make sure the monitoring understands what the value represents. MB,GB,% etc.

The second and third sections are the predefined Warning and Critical states for the service. If the service ever exceeds them it would change it's state accordingly. They are also separated by ';'

The fourth and fifth numbers can be useful in services which return a multitude of values as performance data. They determine the minimum and maximum amount possible for this performance number. The last value is unintuitively not followed by a semicolon. This allows Icinga to interpret the current state in a nice Pie chart enbedded in the icingaweb2 interface.

For example, the free disk space check usually monitors all partitions on your machine. It returns a value for each of them, but if one exceeds the warning/critical threshhold, The entire check changes its state. In order to find what the issue is as fast as possible the above shown pie charts shuffle in a way that always the one with the worst state will be the first you see. Mouse-over it and you will be able to see enough information to know what you're dealing with.