High availability (Multi-master) Kubernetes cluster hosted on AWS
This is a first post of a mini-series dedicated to running Kubernetes hosted on AWS. First post will be about the considerations we have made when proposing production ready and Enterprise grade Kubernetes environment to our clients. I will go more technical, with the tools and AWS services we are using, in the next blog post, here I will try to cover what problems we are solving.
High availability is a characteristic we want our system to have. We aim to ensure an agreed level of operational performance (uptime) for a higher than normal period. These are the principles we follow when doing the system design:
- Elimination of single points of failure. This means adding redundancy to the system so that failure of a component does not mean failure of the entire system.
- Reliable crossover. In redundant systems, the crossover point itself tends to become a single point of failure. Reliable systems must provide for reliable crossover.
- Detection of failures as they occur. If the two principles above are observed, then a user may never see a failure. But the maintenance activity must.
Below graph shows the Kubernetes Master components used for setting up a cluster. Will go thorough them one by one :
Backing up ETCD availability
ETCD is our API object store used by kubernetes. If we lose the data here, we are done, we lose our cluster. Here is what we need to have in mind when running ETCD in highly available setup:
- Run ETCD as a cluster of odd members. An etcd cluster needs a majority of nodes, a quorum, to agree on updates to the cluster state. For a cluster with n members, quorum is (n/2)+1.
- ETCD is a leader-based distributed system. Ensure that the leader periodically sends heartbeats on time to all of the followers to keep the cluster stable.
- Ensure that no resource starvation occurs. The performance and stability of the cluster are sensitive to network and disk IO. Any resource starvation can lead to heartbeat timeout, causing instability of the cluster. An unstable etcd indicates that no leader is elected. Under such circumstances, a cluster cannot make any changes to its current state, which implies no new pods can be scheduled.
- Keeping stable ETCD is critical to the stability of Kubernetes clusters. Therefore, run ETCD on dedicated machines or isolated environments for guaranteed resource requirements.
We also need do ensure that our data stays intact in case the running host crashes for whatever reason. Since we are running our cluster in Amazon, the most convenient and reliable service we can use is the Elastic Block Store (EBS). In this way if we stop, terminate or decide to do any migration, we can easily do it so by attaching the EBS to another instance. Each Amazon EBS volume is automatically replicated within its Availability Zone to protect you from component failure, offering high availability and durability.
Building reliable API Server
If ETCD is the brain of our cluster, then the Kubernetes API server is like the rest of the nervous system. The API server acts as the go-between for all data entering and exiting ETCD - from you and from the worker nodes that your application is deployed on. The great thing about the API server is that considerations for it are mostly driven by your considerations for ETCD. Since each master node on your cluster has an ETCD membership and acts as an API server, you will have high availability for both the data your API server consumes, as well as the the functionality of the API server.
The API Server in this manual setup communicates with ETCD through localhost. This way you will never have to worry about the API server getting disconnected from its data. Our work here is to install API server and set up a load balancer in front of each so we have our load distributed across our entire system. The load balancer is easily set using Amazon ELB. After we have our load balancer address we will need it when setting up the rest of the master components.
Setting up a Controller manager and Scheduler
We haven’t done anything so far that actually modifies cluster state, such as the controller manager and scheduler. To achieve reliability, we only want one actor modifying state of the cluster, but we also want replicated instances of these actors, in case the machine dies. Lease-lock is used in the API to perform master election (leader-elect flag).
The scheduler and controller-manager can be configured to communicate using the load balanced IP address of the API servers. Regardless of how they are configured, the scheduler and controller-manager will complete the leader election process mentioned above by using the --leader-elect flag.
Kubernetes worker components in details:
If ETCD is the brain and the API server is the rest of the nervous system - the workers are their arms and legs. Your workers are where your applications reside, what everyone else interacts with, and are in some ways - the least important part concerning the life of your cluster. Workers can die and be replaced with little - if any consequence to your cluster at all. If you have multiple workers, your application can retain functionality in most situations. You always want multiple minions, but keep in mind that scaling your minions is extremely quick and easy. Its generally safe to start with a small number - and add more in the future if you start running low on space/memory/cpu.
kubelet gets the configuration of a pod from the API server and ensures that the described containers are up and running. This is the worker service which is responsible for communicating with the master node. It also communicates with etcd, to get information about services and write the details about newly created ones.
kube-proxy acts as a network proxy and a load balancer for a service on a single worker node. It takes care of the network routing for TCP and UDP packets.
Multi-zone support is deliberately limited: a single Kubernetes cluster can run in multiple zones, but only within the same region (and cloud provider). One of the limitations for multi-zone setup is that we need to assure that the different zones are located close to each other in the network, so we don’t perform any zone-aware routing. In particular, traffic that goes via services might cross zones (even if pods in some pods backing that service exist in the same zone as the client), and this may incur additional latency and cost.
Talking about scalability
I have barely talked about the methods of how you are going to scale out your applications. We want this the be as flexible as it could be. We have two instance components that need scaling.
Scaling Kubernetes Master - No matter if you are a small startup or a big Enterprise, when going for High Availability in Amazon you want your Masters working across the entire Region. In the given example, I am using eu-west-1 (Ireland). On each Availability zone (eu-west-1a, eu-west-1b, eu-west-1c) we will set 1 Kubernetes Master, wich will be assigned to its own Auto-scaling group.
Why each master is in its own Auto-scaling group ? - The main reason for this is keeping us safe from replicating an unhealthy instance.
Scaling Kubernetes Worker - As described a few lines above, our Kubernetes worker node is our work horse. It is enough to have it setup and configured once. After that it is easy to assign it to an Auto-scaling group. Doing so, rising the "Desired" number of Worker instances, will cause in bringing exactly the same hosts, with exactly the same Worker components. Their kubelets will all be configured the same and they will be able to register to the Kubernetes master.
Running out of resources on one of the Worker instance will bring a fresh one to the other Availability zone. This mechanism is one of the strongest in terms of scaling in Amazon.
After all this talk, it has led us to the Architecture below.
This is how easily it is to set up by hand Kubernetes cluster in a high-available mode with a persistent storage in Amazon :)
In the next blog post will go more technical. Will be setting up a Kubernetes cluster in Amazon from scratch, with a new Amazon IAM account. Then I will show how easy (for real) it is by using kops and its magic.