In my previous blog post, I was discussing how to monitor RabbitMQ as a centralized message Q of Openstack. Well, that’s quite important but the end goal of having a cloud is the instance on top of the machine. Most of you and especially the infrastructure guys who dig into monitoring will know what are the most important components to look over.
The reason to monitor is to have reasonable planning which is probably the drill in a cloud environment where you have to spawn a large number of virtual machines or containers. On the other hand, having the data in one glance is very easy to increase the reliability, uptime plan better your architecture, and identify the bottlenecks of your setup.21JULMy choice around cloud monitoring and DevOps as I mentioned in the previous post is Icinga2. I`ll try to explain in depth why and how it impressed me but let’s focus on Virtual Machine and KVM instances Monitoring. First of all, I`ll start with what is the most important that has to be monitored and define the main characteristics. Good preparation gives the best results!
Indeed we have two perspectives that we can look from. First one and probably quite important interesting one is looking from Host Operating System.From this angle we will see the KVM machine as single process running on it. Something like:
qemu 32381 1 9 Jul14 ? 14:41:20 /usr/libexec/qemu-kvm -name instance-000000c9 -S -machine pc-i440fx-rhel7.0.0
This qemu process represents the virtual machine and all parameters are defining the it. That helps, isnt? We need to monitor
a process and from here we can built our stats to monitor:
- VSZ - virtual memory size of process - resident set memory size of process - percentage of CPU - process existence (instance name would be the best filter for that)
The nagios plugin that facilitates the job is check_procs.
For the last one is not so easy to identify which machine corresponds to which instace id. Usually you can match this in
libvirt.conf inside the nova instances and get your instance name.
Another useful tool for checking the state of the machine is virsh. You can get your state of machine and look for crash machines:
Id Name State ---------------------------------------------------- 39 instance-000000c8 running 40 instance-000000c9 running 41 instance-000000bb running
Usually the states of kvm machines are:
running - state where instance is running and operational paused - in openstack terminology is suspended inactive - stopped or shut off crashed - error occurred when started
Best is to search for crashed and look afterwards for the reason why it crashed.
virsh list | grep crashed
Another shortcut is to use the check_libvirt nagios plugin which does a lot.
The other aspect of monitoring is of course when you are looking on the guest. There are some parameters that might be
correlated with the previous monitoring but there are still main differences:
- memory utilization of the server - cpu load and waits - disk I/O - disk usage
Let’s start them one by one. The first, quite important and not so easy to track is memory utilization.
total used free shared buff/cache available Mem: 74053676 19262360 377524 3720412 54413792 50583164 Swap: 978940 107552 871388
Most of the infrastructure guys or the better ones know the “free” command. Well, what it means, lots of stat and free memory is really small relative to the others. The reason is how Linux operates with memory to save lots of Disk I/O operations. So we calculate free and cached memory as a common metric. Nice plugin for that is here.
It works really well and returns performance data for nice graphs.
check_mem.pl -f -C -w 20 -c 10
CPU LOAD AND UTILIZATION
Another really good metric to get in your dashboard and alerting is cpu load. Load can be high for lots of reasons like iowait, process exhaust the cpu etc…
You can use the check_load function that comes by default with nagios plugins.
If you would like to monitor iowaits from cpu stats you can use the module in nagios exchange for check cpu stats. Make sure sysstat or iostat is installed on the target
server. Here is an example metrics:
check_cpu_stats.sh -w 20 -c 30
Here you can use the metric to find any issues in underlying storage. Together with performance data you will gain even more statistics about how the data performs.
check_io -d sda -w 40,400,400 -c 100,700,700
Last but not least you can verify with check_disk the usage of all disks.