Intro

The rapid adoption of Kubernetes has led to an increase in outages covering entire company operations. Recently SourceClear experienced a Kubernetes outage which lasted two days and affected multiple teams. SourceClear is a software security company which uses data-science and machine-learning to help developers use open-source safely by analyzing the libraries they use as part of their CI/CD pipeline. Their clients include big brands like Atlassian, LinkedIn and Uber.

This post will explore why the outage happened, what they did to fix it, and also what you can do to avoid it happening in your cluster.


The Setup

To start with, this outage affected a QA/DEV Kubernetes cluster. The team had been gearing up for an AWS region migration and to prepare for this, they had been provisioning and tearing down multiple Kubernetes clusters using kubespray, to make sure they had a clear idea of what to do if the migration failed for some reason.

During one of these cycles, a cluster operator unwittingly deleted some IAM roles and EC2 instance profiles with Terraform. This was a bad move because Kubernetes needs them for the cloud-controller-manager. The operator quickly noted the mistake and restored the roles with the proper permissions.

The Challenge

With Amazon EKS clusters, you can associate an IAM role with a Kubernetes service account. Amazon EKS requires applications to sign their AWS API requests with Kubernetes on AWS credentials. The service account provides AWS permissions to the containers in any pod that uses that service account. The applications in the pod’s containers can then use an AWS SDK or the AWS CLI to make API requests to authorized AWS services. This feature takes away the need to provide permissions to the worker node IAM role therefore allowing pods on that node to call AWS APIs.

The Event

The first signs of trouble they saw were errors in the journalctl logs from kubelet. The kubelet was logging its inability to auth to the cloud provider(AWS). This cluster had been provisioned using kubespray and so the team decided to rerun kubespray in the hope that it would detect and rollback any bad configs to what was defined in their playbook.

The Root Cause

kubespray doesn't lay anything down for IAM roles and so rerunning it will force it to use the default variables and not the specified environment variables. So rerunning kubespray made things worse. Everything in the kube-system namespace started falling over.  Also, because of the misplaced variable in the vars file, the cluster was now running two copies of the api-server, controller-manager, kube-scheduler, and two CNI Plugins.

The CNI plugin is tasked with wiring the host network and adding the correct interface to the pod namespace. The plugin allocates VPC IP addresses to Kubernetes nodes and configures the necessary networking for pods on each node. It allows Kubernetes pods to have the same IP address inside the pod as they do on the VPC network. 

The Fix

After understanding the root cause, they fixed the multiple control plane pods and scaled the second CNI plugin to 0 pods. The cluster went up again, and they could now make API calls. Pods began scheduling albeit they observed persistent CrashLoopBackOff for all new pods.

Apparently, another one of those kubespray default values was a pod CIDR. Their desired CNI plugin was sitting in the right VPC subnet, but Kubernetes pods were still getting the wrong CIDR. They had already removed the secondary CNI plugin but where were they getting that value from?

They eventually found out that the secondary CNI pods had left a config file in this directory /etc/cni/net.d/. This is the directory that the CNI reads from (populated by ConfigMap) but for some strange reason, the secondary plugin config file was causing issues by coexisting in that directory with the primary config file.
They began moving through the list of nodes. ssh into the node, rm the config file in question (10-<secondary-plugin>.conf), as that was causing the issues, and sudo reboot with the proper CIDRs. Case solved! The service came back up to the desired state.

Conclusion

One of the challenges of working with Kubernetes is the validation of your manifests prior to deployment. A wrong config in the env values can take down a cluster. This was started by a mistake. A small mistake, but a mistake that took down the cluster for two days. It wasn’t production downtime but it still affected productivity for the distributed team of developers. What followed was a spirited effort at chaos engineering. Kalc offers a less disruptive alternative to chaos engineering.

At Kalc, we have approached this problem by reproducing Kubernetes' behavior in AI. Our Kubernetes AI is continuously trained with the most common failure scenarios. It allows developers to test their env values against these scenarios. This leads to fewer outages, increased CI/CD pipeline stability and visibility improvements. This is a must-have feature for every Kubernetes installation.