Blog

Monitor EC2 Machine with Prometheus in Amazon EKS

Picture of Yoan Spasov
Yoan Spasov
DevOps & Cloud Engineer
07.12.2022
Reading time: 4 mins.
Last Updated: 10.06.2024

Table of Contents

What are Operators in Kubernetes?

Operators are like software extensions to the API server, which provide additional functionalities in Kubernetes. They automate the whole lifecycle of the software they control, in that they allow us to package, deploy, and manage a Kubernetes application. How do they do that? They define CRs (Custom Resources) to run an application and all of its components. CRs allow IT admins to introduce unique objects and types into a Kubernetes cluster to meet specific, custom requirements, which the set of standard, built-in API objects and resources do not meet. CRD (Custom Resource Definition) defines a CR and lists all of the configuration options available to the users of the operator. To make sure that the desired and current state of the software are matching, an Operator monitors a Custom Resource and takes action in case there is a mismatch between “what we have” and “what we want to have”. Some examples of operators include the Istio Operator, Postgres Operator, Prometheus Operator, etc.

What does the Prometheus Operator offer?

Prometheus is an open-source monitoring system, which became a preferred monitoring tool for distributed systems. Deploying Prometheus and the Alertmanager tool could be complicated, but there is a solution present that simplifies and automates the deployment. This is the Prometheus Operator project (https://github.com/prometheus-operator/prometheus-operator). One of the main features of the Prometheus Operator is to monitor the Kubernetes API server for changes to specific resources and objects and to make sure that the present Prometheus deployment matches those objects.

How to monitor a process on a Linux-based AWS EC2 instance with Prometheus and node_exporter?

First of all, we need software on the EC2 machine which gathers the data needed and can expose it in a format that Prometheus understands. This is where the node_exporter comes in. It allows us to fetch hardware, kernel, custom statistics, etc., and expose those onto an URL. Installation of node_exporter is straightforward. We download the archive, unpack it, and move the binary into a proper folder. Bear in mind, that there is no out-of-the-box solution for node_exporter to gather process data for systems using System V. When starting node_exporter, we have multiple flags to choose from. To monitor all of the systemd services, the –collector.systemd is what we are interested in. An example of the service execution start for a node_exporter.service can be seen below:

ExecStart=/usr/local/bin/node_exporter 
    --collector.ntp 
    --collector.systemd 
    --no-collector.fibrechannel 
    --no-collector.infiniband 
    --no-collector.xfs 
    --no-collector.zfs 
    --web.config=/etc/node_exporter/config.yaml 
    --web.listen-address=0.0.0.0:9100 
    --web.telemetry-path=/metrics

So, basically, node_exporter gathers basic process data along with data gathered from the other used flags and makes it available locally on the EC2 on port 9100. The data can be retrieved from the EC2 machine on this URL: https://0.0.0.0:9100/metrics. You can “curl” it with the “-k” option to validate that the information is there. Example data made present by a node_exporeter for one particular service is:

node_systemd_unit_state{name="auditd.service",state="activating",type="forking"} 0
node_systemd_unit_state{name="auditd.service",state="active",type="forking"} 1
node_systemd_unit_state{name="auditd.service",state="deactivating",type="forking"} 0
node_systemd_unit_state{name="auditd.service",state="failed",type="forking"} 0
node_systemd_unit_state{name="auditd.service",state="inactive",type="forking"} 0

In other words, it simply tells us in which state a service is currently in. The above means that the “audit” daemon is in an active state, running.

How to make the data exposed by node_exporter available in Prometheus?

As a first step, we need to define an inbound rule in the EC2 security group that allows connections from the Kubernetes EKS cluster to the EC2 machine on port 9100, on which the node_exporter is listening. You should use “Custom TCP” as а Type and as a source for the rule, we have to define the network in which Kubernetes is running. Validate that the communication is working by performing telnet to the EC2’s port 9100 from a container in the namespace where Prometheus Operator is installed. As a second step, we need to define a target in the Prometheus Operator HELM chart, which tells Prometheus to scrape data from that target. It’s better to have a DNS entry for the EC2 machine and use that A record for the target. Considering that this is an internal communication between the EC2 machine and the EKS cluster, we can use a self-signed certificate and skip its verification. Example configuration for that:

- job_name: ec2-monitoring-node-exporter
  scheme: https
  tls_config:
    insecure_skip_verify: true
  static_configs:
  - targets:
    - 'ec2-dns-record:9100'
  dns_sd_configs:
  - type: A
    port: 9100
    names:
    - 'ec2-dns-record' 

Once the above configuration is loaded and the Prometheus config-reloader container has reloaded, it will start scraping data from the EC2’s port 9100, which will be made available on the defined Prometheus URL. On that URL, we can click on “Targets” and we will be able to see the target we defined above.

How to set up an alert, that will fire in case a process stops working?

Now that we have made the network configuration needed for communication between EKS and the EC2 machine, started the node_exporter, and defined a target in Prometheus, we should also define an alert to get fired in case a service stops working. In the Prometheus Operator HELM chart, we can do it like that:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: ec2-monitoring-node-exporter
  labels:
    prometheus: infra-prometheus
spec:
  groups:
  - name: systemd.rules
    rules:
    - alert: HostSystemdServiceStopped
      expr: node_systemd_unit_state{instance="ec2-dns-record", job="ec2-monitoring-node-exporter", name="auditd.service ", state="active", type="simple"} == 0
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: Host systemd service stopped (instance {{ $labels.instance }})
        description: "systemd service stoppedn  VALUE = {{ $value }}n  LABELS = {{ $labels }}"

The “job_name” defined for the target must match the “job_name” defined in the alert. The above alert will fire an alarm in case the “auditd.service” is no longer in an “active” state, in other words, has stopped working.

For more expert tutorials from our seasoned DevOps engineers, please check out our blog.

Leave a Reply

Your email address will not be published. Required fields are marked *

More Posts

Introduction Integrating Alertmanager with Microsoft Teams enables you to receive alerts directly in Teams channels, facilitating swift collaboration with your team to address issues promptly. This guide will use Prometheus-MSTeams,...
Reading
Managing Terraform locals efficiently is crucial for creating clean, maintainable, and reusable configurations. This guide will cover what Terraform locals are, how to implement them, and best practices for their...
Reading
Get In Touch
ITGix provides you with expert consultancy and tailored DevOps services to accelerate your business growth.
Newsletter for
Tech Experts
Join 12,000+ business leaders and engineers who receive blogs, e-Books, and case studies on emerging technology.