Blog

Monitoring Openstack Part 2

Monitoring Openstack Part 2

In my previous blogpost I was discussing how to monitor RabbitMQ as a centralized message Q of Openstack. Well, that's quite important but 
the end goal of having cloud are the instance on top of the machine. Most of you and especially the 
infrastructure guys who dig into monitoring will know what are the most important components to look over.
The reason to monitor is to have reasonable planning which is probably the drill in cloud environment where you have spawn
large number of virtual machines of containers. On the other hand having the data in one glance is very easy to 
increase the reliability, uptime plan better your architecture and identify the bottlenecks of your setup.

My choice around cloud monitoring and devops as I mentioned from previous post is Icinga2. I`ll try to explain in depth 
why and how it impressed me but let's focus on Virtual Machine and KVM instances Monitoring. First of all, I`ll start
with what is the most important that has to be monitored and define the main characteristics. Good preparation gives best results!

Host Monitoring


Indeed we have two perspectives that we can look from. First one and probably quite important interesting one is looking from Host Operating System.
From this angle we will see the KVM machine as single process running on it. Something like:

qemu     32381     1  9 Jul14 ?        14:41:20 /usr/libexec/qemu-kvm -name instance-000000c9 -S -machine pc-i440fx-rhel7.0.0


This qemu process represents the virtual machine and all parameters are defining the it. That helps, isnt? We need to monitor
a process and from here we can built our stats to monitor:


- VSZ - virtual memory size of process
- resident set memory size of process
- percentage of CPU
- process existence (instance name would be the best filter for that)


The nagios plugin that facilitates the job is check_procs.
For the last one is not so easy to identify which machine corresponds to which instace id. Usually you can match this in
libvirt.conf inside the nova instances and get your instance name.
Another useful tool for checking the state of the machine is virsh. You can get your state of machine and look for crash machines:


 Id    Name                           State
----------------------------------------------------
 39    instance-000000c8              running
 40    instance-000000c9              running
 41    instance-000000bb              running


Usually the states of kvm machines are:


running - state where instance is running and operational
paused - in openstack terminology is suspended
inactive - stopped or shut off
crashed - error occurred when started


Best is to search for crashed and look afterwards for the reason why it crashed.
virsh list | grep crashed
Another shortcut is to use the check_libvirt nagios plugin which does a lot.
The other aspect of monitoring is of course when you are looking on the guest. There are some parameters that might be
correlated with the previous monitoring but there are still main differences:


- memory utilization of the server
- cpu load and waits
- disk I/O
- disk usage


Memory


Let's start them one by one. The first, quite important and not so easy to track is memory utilization


              total        used        free      shared  buff/cache   available
Mem:       74053676    19262360      377524     3720412    54413792    50583164
Swap:        978940      107552      871388


Most of the infrastructure guys or the better one know the “free” command. Well, what it means, lots of stat and free memory is really small relatively to the others. The reason is how Linux operates with memory to save lots of Disk I/O operations. So we calculate free and cached memory as common metric. Nice plugin for that is:
https://github.com/justintime/nagios-plugins
It works really well and returns performance data for nice graphs.
Example:

check_mem.pl -f -C -w 20 -c 10


CPU Load and Utilization

Another really good metric to get in your dashboard and alerting is cpu load. Load can be high for lots of reasons like iowait, process exhaust the cpu etc…
You can use the check_load function that comes by default with nagios plugins.
If you would like to monitor iowaits from cpu stats you can use the module in nagios exchange for check cpu stats. Make sure sysstat or iostat is installed on the target

server. Here is an example metrics:

check_cpu_stats.sh -w 20 -c 30

Disk I/O


Here you can use the metric to find any issues in underlying storage. Together with performance data you will gain even more statistics about how the data performs.

check_io -d sda -w 40,400,400 -c 100,700,700

Disk Usage


Last but not least you can verify with check_disk the usage of all disks.


Well, here comes the end of the monitoring. If you are looking for any help in terms of monitoring feel free to contact us in contact form.