What is Icinga?

Icinga is an open-source monitoring system that checks the availability of network resources, notifies users of problems that occur, and aggregates data for effective reporting. Icinga is scalable and extensible software and can monitor large, complex environments across multiple locations. Icinga basically means “it looks for”, or “it examines” in Zulu. 

What are the benefits of using the Icinga Monitoring Tool?

  • Identification of issues anywhere in the network

Monitoring an environment with Icinga helps with the fast identification of issues. When you monitor an entire environment you can quickly see data from multiple servers and identify the root cause of an issue. Moreover, you’re often able to predict a problem that is bound to happen based on historical data. For example, when you see a load of a web app gradually increasing you can determine how much time you have before you give it more resources or optimize it.

  • Better use of IT Resources

Monitoring tools like Icinga save time by notifying you in case any metrics go outside of the expected ranges, while your specialists can focus on higher-value tasks. It can even remediate some of the more straightforward issues automatically if it is configured to do so. This helps you focus, increase efficiency, and better distribute the effort in the team.

  • Historical and Baseline Data

Icinga2 stores a lot of time-based data. It is divided into two main branches – History and performance data. The History in Icinga2 stores the state changes of all hosts and services and any notifications or downtimes triggered for them. It is stored in the relational database of the Icinga2 master. The Performance data is usually data gathered from the commands that are executed for monitoring purposes. This data is stored in a time series database and can be visualized easily with a tool like Grafana. It is essential to have a good overview of how resources are being utilized over time, how applications or hardware are behaving, and to predict issues before they happen.

How does Icinga work?

icinga2 distributed monitoring roles
Source: https://icinga.com/docs/icinga-2/latest/doc/images/distributed-monitoring/icinga2_distributed_monitoring_roles.png

Each node in the Icinga scenario has one of three roles: master, satellite, or agent. The main machine that will collect the metrics is registered as a master

If we have many servers to monitor (machines in different private subnets with no direct connectivity to them), we can add the so-called satellite instances that serve as a proxy for accessing the private subnets. For example, for each client, we may have a satellite that relays the check execution from the master to each of the instances in the private subnet. 

The machines which are being monitored – the so-called endpoints are registered as agents. Icinga Agents are executing monitoring scripts that return the status of the script execution as well as optional performance data back to the master.

For more reliability, we can create two instances with Icinga2 which are registered as master, and which are in one zone, so that if there is a problem with one of them, the other can take over. The same is true for satellite instances. In a highly available setup, we can have two satellites for each zone. The master uses them on the round-robin to schedule its command executions. This ensures connectivity from the master to the end hosts is uninterrupted even if something happens to one of the satellites.

Configuration

Once we have registered all the machines as master, satellite, or agent, and once we have a web interface (Icinga Web 2), we need to add the hosts and metrics. The file loading process for Icinga2 begins with icinga2.conf file. It usually loads all files with the .conf extension recursively from the conf.d directory via the following definition:
include_recursive “conf.d”.

Hierarchy

The hierarchy in Icinga2 consists of objects of type zone. Zones depend on the parent-child relationship to be able to trust each other. A server from a parent zone can pass its configuration information to its child servers but hosts lower in the hierarchy cannot pass configuration data to the parent zone. This is valid for the top-down configuration setup, which is most common.

Servers-agents also have their own zone. As a standard practice, we use FQDN (Fully Qualified Domain Name) to name this zone.

Objects, templates, and apply rules

Icinga overview hosts tactical overview services hostgroups servicegroups contactgroups downtimes comments
Introduction to Icinga Monitoring Tool 4

In the configuration files, we describe all the elements that we see in the Overview section, as well as other templates, and commands, and apply rules and configurations that are needed to have a good description and structure in Icinga.

Every check we perform on a server we monitor is nothing more than a script executed on it. It is a good idea to keep the absolute path to all default scripts by Nagios, along with any scripts that we additionally download or write, in a constant variable. The default constants are defined in constants.conf file in the main Icinga directory (usually /etc/icinga2). In the case of a distributed setup like the one we’re describing here, it’s best to keep the constants in a global zone e.g. global templates to ensure you don’t need to define them in the constants.conf of every single satellite or agent.

object CheckCommand "SYSTEM.cpu" {
        timeout="60"
        command = CustomPluginDir + "/check_cpu_stats.sh -w $cpu_warn$ -c $cpu_crit$"
            vars.cpu_warn = "20"
            vars.cpu_crit = "30"
}

To use these scripts, we need to define them as objects of type CheckCommand.

The main thing about the CheckCommand object definition is that we specify the path to the script we want to execute and the parameters we would like to pass to it. In this case, CustomPluginDir is a constant. It then concatenates it with the remaining path to the script we need, along with the parameters that can be passed to it. The parameters are always best kept as variables that can be later overridden in the Service Definition. We can then create an apply Service rule to specify exactly which CheckCommand object to attach to which hosts. For example:

apply Service "SYSTEM.cpuStats." for (cpu_load => config in host.vars.cpu_stats) {
        import "generic-service"
        check_command = "SYSTEM.cpu"
        command_endpoint = host.vars.client_endpoint
        max_check_attempts = 10
            vars += config
  assign where host.vars.client_endpoint && host.vars.cpu_stats
}

(The += operator adds the contents of a dictionary host variable “vars.cpu_stats” as variables for the service. It overrides any default variables set in the CheckCommand definition or those written before it’s in the Service definition..  )

Via the assign where clause we describe which hosts to apply this service to. In our example, it applies to those hosts that have the vars.client_endpoint and vars.cpu_stats variables defined, no matter what their value is. We can also apply the service to exactly those hosts that have a specific value of a given variable, for example, assign where host.vars.os = “Linux”.

If we don’t have an assign where <condition> line in the apply rule, Icinga will give an error (of type Critical) that the corresponding service is not connected to any host. If we have defined an assign rule, but it doesn’t match any host, then Icinga will throw a Warning error. The corresponding service will exist in the configuration, but will not be attached to any instance.

template Service "generic-service" {
  max_check_attempts = 5
  check_interval = 2m
  retry_interval = 1m
}

Icinga allows us to create service templates. In them, we indicate the maximum number of times the check should be performed before making sure that the status of the service has changed to a warning, critical or unknown, and sending a notification. In the template we also set check_interval, i.e. at what interval to execute the given check, as well as retry_interval i.e. when a check is not successful, how long to wait before executing it again. Template Service values can always be overridden in each service individually, after the line where they are imported. They can also contain vars or other metadata, depending on the structure of the configuration.

Of course, we must also have a defined host to apply a given service.

object Host "hostname1" {
  import "generic-host"
  address = "10.1.1.1"
  vars.client_endpoint = "hostname1"
  //...
  vars.cpu_stats.dft = {
    cpu_warn = "50"
    cpu_crit = "70"
  }
}

In Icinga, we have the ability to create host templates as well. In these, we can again set the maximum number of checks to be made before the service state transitions to a hard. With hard state with optional EventCommands or notifications along the way. It also contains information about how often to run the check, and how often to rerun it if it fails. Furthermore, in the Host template we can define the CheckCommand as well.

template Host "generic-host" {
  max_check_attempts = 3
  check_interval = 1m
  retry_interval = 30s
  check_command = "custom_health"
}

object CheckCommand "custom_health" {
    command = CustomPluginDir + "/hostalive.py -H $address$ -n $host.display_name$ -s -e $extra_port_check$"
}

In our example, in the object Host “hostname1” we import the template “generic-host” which executes the command “custom_health”

In a distributed setup like this, all configuration is defined on the master server and is distributed to the servers below ( satellites and agents ) via the API connectivity that is established during server registration. All servers need to know the command and service definitions in order to execute a script and that’s what the global-templates zone is used for. It is distributed to all servers in the monitoring network. The rest of the zones are usually satellite zones and are distributed accordingly. For e.g. satellite zone site1 is transferred over to the satellite instances in the site1 zone so that they know which agents are present in their zone. 

Notifications

In order to manage our infrastructure effectively and respond to critical alarms quickly, we need to be able to be notified when a problem occurs. For this purpose, in Icinga, we can configure the NotificationCommand object, in which we specify the command itself.

object NotificationCommand "service-notification-itop-p1" {
  command = [ CustomPluginDir + "/icinga2-cli/icinga2_cli.py", "notify" ]
  arguments = {
      "-m" = "itop"
      "-f" = NotifPluginDir + "/notify-itop-service.txt"
  }
  env = {
        SERVICENAME = "$service.name$"
    //...
  }
}

We can also define a Notification template in which we specify which Notification command to use.

Here, we define the command to execute, which states and types of notifications to execute it for and during which time period.

template Notification "service-notification-itop_24x7_p1" {
  command = "service-notification-itop-p1"
  states = [ OK, Warning, Critical, Unknown ]
  types = [ Problem, Custom ]
  vars += {
    notification_logtosyslog = true
  }
  period = "24x7"
}
object TimePeriod "24x7" {
  display_name = "Icinga 2 24x7 TimePeriod"
  ranges = {
    "monday"    = "00:00-24:00"
    "tuesday"   = "00:00-24:00"
    "wednesday"    = "00:00-24:00"
    "thursday"   = "00:00-24:00"
    "friday"    = "00:00-24:00"
    "saturday"   = "00:00-24:00"
    "sunday"   = "00:00-24:00"
  }
}

In order for the notification to apply to the Service or Host, we need to define how it does that, just like we do for a service. Here we have the option to set which users or groups to notify, usually taken from the host config. We can set an “interval” for renotification if the problem persists and of course, the familiar “assign where” statement, which we also know from the Service definition.  

apply Notification "icingaadmin-itop_24x7_p1" to Service {
  import "service-notification-itop_24x7_p1"
  user_groups = host.vars.notification.api_itop_24x7_p1.groups
  users = host.vars.notification.api_itop_24x7_p1.users
  interval = 30m
  vars.notification_logtosyslog = true

  assign where service.vars.eoc_prio == "p1" &&  host.vars.notification.api_itop_24x7_p1
}

In it, we also specify which users the notification should be sent to. In order to achieve this, we must have an object User or an object UserGroup, in which to define the users and specify an email address or other means of notification delivery.

object Host "some-host"{
  vars.notification["api_itop_24x7_p1"] = {
    users = [ "clientname-p1" ]
  }
  //...
}

From here we see that in the above apply Notification rule the expression host.vars.notification.api_itop_24x7_p1.users has a value of clientname-p1. However, on the specific host, we do not have a value for vars.notification.api_itop_24x7_p1.groups, so a notification will only be sent to the specified user.

object User "clientname-p1" {
  import "generic-user"
  display_name = "ClientName HighPriority User"
  email = "clientname_p1@companyname.com"
  //...
}

The most common thing a User object can have in its definition is the email to which the notification should be sent. We can also have a User template with additional features to import into the User object “clientname-p1”. User objects can also have vars, for more complex delivery processes.

template User "generic-user" {
  //...
}

We can write all these configurations in a single file with the .conf extension which should be located in the main Icinga directory. For our convenience, we can also separate them into separate configuration files, in different directories.

Conclusion

Icinga is an open-source monitoring software that provides many configuration mechanisms so that we can organize the monitoring of the environment as efficiently as possible. Thanks to these mechanisms we can make our own decisions about which are the most critical metrics to monitor, when to send notifications, at what interval, etc. Icinga can be used for all types of cloud environments (public, private, and hybrid cloud).  Even though icinga2 lacks out-of-the-box support for modern solutions like Kubernetes, a lot of open-source projects like Signalilo strive to bridge the gap between more container-oriented solutions like Prometheus and Icinga2.

Author:

Ralitsa Dimitrova, SRE Team

For more resources and expert tutorials on Icinga check the Icinga 2 Fine Tuning”, as well as Icinga 2: API and Passive Checks”.

Leave a Reply

Your email address will not be published.