Pivotal offers business transformation, a cloud-native platform, microservices, containers, developer tools, and consulting services to help enterprise-level businesses to build and run their applications. VMware recently showed intentions to acquire Pivotal for $2.7 bn.
Handling issues in the production setup is never easy, especially with the Kubernetes cluster environment. Pivotal recently faced an application outage in their Pivotal Container Service (PKS). They recovered it quickly, but we’ll see in this blog what happened with the cluster, what the impact was, and how they recovered.
It was a scale-out java application that ran on Pivotal Container Service (PKS) on-prem and received requests from outside Kubernetes using an appropriate database call that was also from outside Kubernetes and returned results. This app is accessed via an Ingress with a standard Service endpoint and performed a ton of processing upon startup. It took 20-24 minutes’ startup time to warm up its cache before receiving requests. Pivotal ran that application pretty fine for several weeks and scaled-out about 30 pods.
They required to upgrade the eight nodes cluster from PKS 1.2.6 to 1.3. The upgrading of PKS is mostly an automated process that normally processes like this:
i. The upgrade is initiated;
ii. Rest is performed by BOSH:
· Node by node:
Cordon the Node + Drain
Node is deleted from the IaaS
The new node is created from the new image on IaaS, and
Node is added to the cluster
Each node in PKS normally takes 3-4 minutes to complete the upgrade cycle and all is handled through an automated process.
After 25 minutes of the upgrade process, the alert was generated by the monitoring system about the failure of transactions where more than 95% of the requests to the application were timed-out, and the application went totally unresponsive. During this type of behavior, Kubernetes cluster was responsible for migrating affected pods to other nodes of the cluster, and all the migration process may be invisible to end-users. How did this happen?
The Root Cause
The critical part of this outage is the above-mentioned 25-minute startup time. Three pods of each of the 8 nodes ran in deployment when the first node of the cluster drained, and its 3 pods turned out and restarted on other nodes of the cluster, but the pods on the updated nodes required to complete the 25-minute startup time to warm up for caching.
After 3-4 minutes of the upgrading process for the first node, the second node of the cluster experienced the same transaction failure issue. The 3 pods of the node evicted and started their 25-minute startup process and even transferred pods from the first node not yet started on the second node and got killed again.
The upgrading process for the entire cluster took 20 minutes and all pods executed as per the deployment design, but none of them accepted traffic and became the part of Ingress, and this caused a complete failure of the application.
Thanks to Kubernetes, that offered a feature called Pod Disruption Budget (PDB), that stopped the node from fully draining in case of overall application failure. For this particular app, they suffered a 70 percent loss of the pods but without any negative impact. By using PDB, a value can be declared and any given set of selectors can be applied.
In this particular case, the PDB was created like this:
With this, a draining process started and the Pod Disruption Budget (PDB) checked the current state of all pods that were aligned to the selector app: myappname. Then, upgrade workflow waited to drain the node until there were enough pods to stay below maxUnavailable: 30%.
Fortunately, the outage remained about 14 minutes. When the last node was upgraded, the pods of the first node were nearly about to complete. It was an inexpensive lesson for the Pivotal engineers to do a few things differently than before, such as large-scale destructive testing rather than the basic tests that were tried, documenting the best practices, and figuring out to reduce the startup time of the app.
The other way to prevent such an issue, is to use Kalc Kubernetes config validator. With Kalc, you can minimize your Kubernetes issues by running autonomous checks and config validations and predicting other risks before they affect the production environment. The AI-first Kubernetes Guard navigates through a model of possible actions of events that could lead to an outage, by modeling the outcomes of changes and applying growing knowledge of config scenarios.