How could we protect our Kubernetes clusters from deploying malicious containers or misconfiguration existing ones?
Before we go into the depth of building a reasonably secure Kubernetes environment, we should discuss how the infrastructure of our organization is deployed. Of course, there are many ways to do so and the one I propose here is just one suggestion that might work.
Initializing Secure Kubernetes Cluster
Initially we need to have Kubernetes clusters up and running either in a self-hosted solution or in another managed cloud provider – Azure, GCP, AWS, Linode, etc. The self-hosted solutions (like Openstack) indeed provide us with more control and options, but we need to take care for most of the securing our control plane. There are plenty of high quality tutorials and blog posts about that, that I will not explore here.
This base infrastructure (Clusters, networking infrastructure, load balancers, etc.) can be spun up using IaaC (Infrastructure as a Code) solution like Terraform or Pulumi and also hosted and maintained via pipelines from CI DevOps platforms like GitHub, GitLab or Azure DevOps.
Both Terraform and Pulumi have extended documentation on how to provision Kubernetes clusters with the appropriate best practices.
This cluster should have deployed Continuous Deployment (CD) solution like ArgoCD or FluxCD available and configured to monitor specific (preferably closed) repository in our code repository of choice (GitHub, GitLab, Azure DevOps, etc.) where the YAML files with the configurations for our business applications will be stored. This repository access must be very tightly regulated and should be accessed only from senior DevOps engineers and selected experienced developers. These YAMLs should contain all that is necessary for the business apps to be deployed – deployments/statefulsets, networking services, horizontal autoscalers, resource reservations, integration with secret stores, etc. Even though it is a closed repository, all standard best practices still apply – no sensitive information should be stored – passwords, API keys, ssh private keys, etc.
This repository should integrate with all the other repositories holding our business applications so when they get updated and build a container with a newer tag it will trigger action to replace the old tag with the new so our CD solution can pick it up.
Building Secure Containers for Business Applications
Let us imagine that our organization has a team of skilled developers that are tasked with building the containerized applications for the business that we support.
After they push their code to the respective git solution, the CI implementation (Github action or Gitlab pipelines (and runners), Azure DevOps) can start running routine unit tests, usually defined by the developers themselves, but also after that we have the opportunity to set up several interesting tools.
These tools run in the CI pipeline and can monitor the whole building process of the container.
- Cosign can sign the image (more precisely its hash) with a unique supply from the developer private key, so we can be sure that this container comes from the said developer before it is pushed to our container registry (Docker hub, ACR, etc). This would be useful if the registry gets compromised and other rogue images are pushed, that replace the legit one;
- Trivy can also be run in a CI pipeline as part of the build job and perform scanning while the container is created. It uses a precompiled database that scans the container against. It can also break the pipeline if high enough vulnerabilities are found and produce an error log which can be investigated. Trivy can be also used to scan terraform .tf files for misconfigurations.
When both tasks are complete, we will have a newly built scanned and signed container with no known vulnerabilities in it. Please keep in mind that most of the time vulnerabilities come from the underlying base image. It is a good idea to use multiplayer builds so our running container will have the least tools available in case it gets compromised. Obvious no-nos are curl, git, wget, make, package managers, etc. Good ideas are Scratch and Alpine Linux. Also manual rebuilds of the container on a regular basis is desired. The base image can get old and can expose previously unknown vulnerabilities.
Once the container is built and put on our registry with new tag the pipelines/actions that update the new tag in the DevOps repository (that is monitored by our CD solution – ArgoCD or FluxCD) it pulls it down in the Kubernetes cluster and deploys it.
There are several actions that we should be taking here:
- To verify the signature created with cosign we can use tool like Connaisseur to verify that the image pulled is the one the developer builds in their pipeline (like that we can prevent pulling malicious images if our container repository is compromised). It needs the public key, corresponding to the private key, used to sign the container and can break the deployment.
- We can verify if the security context, resource requests and other policies that orchestrate the image are up to the industry standard. Here we can also use Trivy which can run in the cluster as operator and identify both security vulnerabilities and misconfigurations and create reports as CRs or Datree that provides a gorgeous web UI with a vast number of built- in and customizable policies. Unfortunately, it has a trial version only for 14 days. Both also support workbooks that can terminate deployment if a critical misconfig or vulnerability is found.
So far we have built, scanned and deployed with enabled secure context, resource limits, etc. We should be monitoring the behavior of these containers, the requests they make to the Linux kernel of the underlying Kubernetes nodes. This kind of monitoring needs to be tailored very well to the environment, since a lot of calls can be benign, but still flagged, so overview of all components in the environment is required.
Two known tools suitable for that task are Falco from SysDig and NeuVector now owned by SUSE. Both can be deployed in the cluster and also provide webhooks to terminate misbehaving containers (for example, if a terminal is opened in a production environment). They also can be integrated with notification solutions like Slack, Teams, Rocketchat, etc.
These tools can be configured to monitor vast numbers of system calls from and to our running pods (containers) and we can exercise strict control on them.
Finally, all listed here are just automated tools that deliver telemetry to the security operator. It is up to us to design the right rules and policies so our clusters can get more secure and resilient.