Jetstack is a fast growing Kubernetes professional services company that helps startups, SMBs, and enterprises to modernize their cloud-native Kubernetes infrastructure. They have been building, operating, and contributing to the Kubernetes ecosystem since 2015.
Jetstack provides multi-tenant platforms on Kubernetes to their customers and sometimes, customers ask for special requirements that cannot be controlled with stock Kubernetes configuration. To implement such requirements, they recently started to use the Open Policy Agent project as an admission controller to implement customized policies. In this post, we’ll discuss an incident caused by a misconfiguration in Kubernetes environment, and how Jetstack’s engineers recovered it.
The Kubernetes cluster is installed and configured in europe-west1 on the Google Kubernetes Engine (GKE). Jetstack wanted to upgrade the master for a development cluster that had to be used by different teams to test their applications.
Teams were already warned about the upgrade process that was taking place. They had already upgraded another pre-production Kubernetes environment earlier that day.
They wanted to upgrade the master for a development cluster that is used by the number of teams to test the applications during the workday. This cluster was running in the europe-west1 region of GKE. They started the upgrade process through their GKE Terraform pipeline.
When the master upgrade process started, Terraform got a timeout that was set to 20 minutes before completion of the upgrade process. The cluster, though, still showing the upgrading process in the GKE console. It was the first sign that something was wrong.
They ran again the pipeline which resulted in the following error:google_container_cluster.cluster: Error waiting for updating GKE master version:All cluster resources were brought up, but the cluster API is reporting that:component "kube-apiserver" from endpoint "gke-..." is unhealthy
The API server timed out a number of times and development teams could not deploy their apps during this time. At the time of investigation, all the nodes were destroyed and recreated in an endless loop that led to an undiscriminating service loss for all tenants.
The Root Cause
To rectify the issue, they contacted Google Support and Google identified the following sequence of outage events:
- The upgrade process for one master instance was completed by GKE and all API server traffic is received as the following masters were upgraded.
- The API server was unable to run PostStartHook for ca-registration during the upgrading of the second master instance.
- The API server made an effort to update a ConfigMap called extension-apiserver-authentication in kube-system while running this hook. This operation timed out for the validating Open Policy Agent (OPA) webhook they configured, it was not responding.
- This operation must be completed for a master to ensure a health check, because it was continuously failing the second master that entered into a crash loop and halted the upgrade process.
This showed an intermittent downtime of API and, due to this, kubelets were unable to report node health that triggered the GKE node with an auto-repair to recreate nodes.
Once they identified that the webhook was causing the issue with intermittent API server access, they deleted the ValidatingAdmissionWebhook resource to restore the cluster service.
Since then, they configured their ValidatingAdmissionWebhook for OPA to only monitor the namespaces where policy can be accessed and granted to the development teams. They only enabled the webhook for Ingress and Service resources - those were the only resources validated by their policy.
Jetstack engineers deployed OPA first and documentation was also updated accordingly to reflect the change.
A liveness probe was also added to ensure that OPA is restarted when unresponsive occurs and the documentation was also updated.
If they had had any alerting mechanism defined on the API server response time, they might have noticed the increase of all CREATE and UPDATE requests after deploying the OPA-backed webhook initially.
It conjointly hammers home the worth of configuring probes for all workloads. In discernment, deploying OPA was therefore misleadingly easy as they had a tendency not to reach for a Helm chart that they perhaps ought to have. This chart encodes a variety of changes on the far side of the fundamental config within the tutorial, together with a livenessProbe for the admission controller containers.
The other way to prevent such an issue, is to use the Kalc Kubernetes config validator. With Kalc, you can minimize your Kubernetes issues by running autonomous checks and config validations and predicting other risks before they affect the production environment. The AI-first Kubernetes Guard navigates through a model of possible actions of events that could lead to an outage, by modeling the outcomes of changes and applying growing knowledge of config scenarios.