
Monitoring Openstack Part 2
Host Monitoring
qemu 32381 1 9 Jul14 ? 14:41:20 /usr/libexec/qemu-kvm -name instance-000000c9 -S -machine pc-i440fx-rhel7.0.0
This qemu process represents the virtual machine and all parameters are defining the it. That helps, isnt? We need to monitor
a process and from here we can built our stats to monitor:
- VSZ - virtual memory size of process
- resident set memory size of process
- percentage of CPU
- process existence (instance name would be the best filter for that)
The nagios plugin that facilitates the job is check_procs.
For the last one is not so easy to identify which machine corresponds to which instace id. Usually you can match this in
libvirt.conf inside the nova instances and get your instance name.
Another useful tool for checking the state of the machine is virsh. You can get your state of machine and look for crash machines:
Id Name State
----------------------------------------------------
39 instance-000000c8 running
40 instance-000000c9 running
41 instance-000000bb running
Usually the states of kvm machines are:
running - state where instance is running and operational
paused - in openstack terminology is suspended
inactive - stopped or shut off
crashed - error occurred when started
Best is to search for crashed and look afterwards for the reason why it crashed.
virsh list | grep crashed
Another shortcut is to use the check_libvirt nagios plugin which does a lot.
The other aspect of monitoring is of course when you are looking on the guest. There are some parameters that might be
correlated with the previous monitoring but there are still main differences:
- memory utilization of the server
- cpu load and waits
- disk I/O
- disk usage
Memory
Let's start them one by one. The first, quite important and not so easy to track is memory utilization
total used free shared buff/cache available
Mem: 74053676 19262360 377524 3720412 54413792 50583164
Swap: 978940 107552 871388
Most of the infrastructure guys or the better one know the “free” command. Well, what it means, lots of stat and free memory is really small relatively to the others. The reason is how Linux operates with memory to save lots of Disk I/O operations. So we calculate free and cached memory as common metric. Nice plugin for that is:
https://github.com/justintime/nagios-plugins
It works really well and returns performance data for nice graphs.
Example:
check_mem.pl -f -C -w 20 -c 10
CPU Load and Utilization
Another really good metric to get in your dashboard and alerting is cpu load. Load can be high for lots of reasons like iowait, process exhaust the cpu etc…
You can use the check_load function that comes by default with nagios plugins.
If you would like to monitor iowaits from cpu stats you can use the module in nagios exchange for check cpu stats. Make sure sysstat or iostat is installed on the target
server. Here is an example metrics:
check_cpu_stats.sh -w 20 -c 30
Disk I/O
Here you can use the metric to find any issues in underlying storage. Together with performance data you will gain even more statistics about how the data performs.
check_io -d sda -w 40,400,400 -c 100,700,700
Disk Usage
Last but not least you can verify with check_disk the usage of all disks.