Universe.com, a division within Ticketmaster, is shaping the future of the event industry using Kubernetes. They provide meaningful, real-life experiences to people around the globe through a world-class event ticketing platform. Users use the platform to find events in their area, and event organizers use it to publicize their events. Universe has over 28,000 event organizers spread out across 400 cities around the world and with a ~$250,000 IT platform budget, Universe is far from a small operation. They are coping with this growth by migrating more and more services to Kubernetes.
Recently, they fixed a Kubernetes Jobs related issue that was affecting searching in Discover. Discover is a recommendation engine for finding events nearby, based on your interests, and based on what your friends are doing. The search tool uses data fields that are indexed for full-text search. Universe recently migrated this tool to Kubernetes and they've implemented it using Kubernetes Jobs to run it as a background task.
Shortly after migration, they immediately noticed problems with the Kubernetes cluster. This affected Universe monitoring and search APIs, along with the Kubernetes cluster UI.
They suspected that the Kubernetes Job had brought the breaking change and indeed some quick troubleshooting revealed the issue was caused by the Job. Jobs would always restart upon failure and were running more than once on the same Node.
This job would consume all available resources on a Node, causing it to become unhealthy and get killed. Kubernetes would then reschedule the greedy Job on another Node, and the same event would replay.
They had one Job object that was meant to reliably run one Pod to completion. The Job object will normally start a new Pod if the first Pod fails or is deleted (for example in this case due to a node hitting its resource limit).
The Engineers were forcing a non zero exit code and would expect the Job to not be recreated. Even despite setting RestartPolicy=Never on the Pod manifest, the Job controller kept on creating more and more Pods upon failure.
The Root Cause
A container running inside a Pod may fail for several reasons. In this case, it was because the process in it exited with a non-zero exit code. If this happens, and if .spec.template.spec.restartPolicy = "OnFailure", then the Pod remains on the Node, but the container is re-run.
The Job controller creates a never-ending stream of Pods which end up in Error state if the Job fails to terminate. The Job controller is written in such a way that it will try to reach completion.
The biggest problem with 'try to reach completions' is that it creates many dead/error Pod objects in etcd which significantly slow down all api calls.
Setting restartPolicy: OnFailure will prevent the never-ending creation of Pods because it will just restart the failing one.
However, if you want to create new Pods on failure with restartPolicy: Never, you can limit them by setting activeDeadlineSeconds. Upon reaching the deadline without success, the Job will have the status reason: DeadlineExceeded. Additional Pods will cease to be created, and existing Pods will be deleted.
You can measure the impact of the changes you make in your Kubernetes cluster with Kalc. Our AI-first solution will help you minimize the risk of pushing this and other breaking changes to your Kubernetes cluster. We’ve trained our Kubernetes AI with the most common failure scenarios and we are letting developers test their configs against these scenarios.
We have seen that the restart policy is applied to a Pod not to a Job. By design, Kubernetes’ main role is to run a Pod to successful completion. The control plane cannot judge based on the exit code whether a Job failure was expected and it should re-run or it was not and should stop re-trying. It's the user’s responsibility to make that distinction based on the output from Kubernetes.
We have also learnt that the OnFailure restart policy means that the container will only be restarted if something goes wrong and the Never policy means that the container won't be restarted in spite of why it exited.
Finally, we’ve seen that you can protect you Kubernetes cluster from human error by leveraging Kalc’s AI-first solution. This solution enables you to replicate your current cluster environment in AI. Our new tool, kubectl-val, will allow your developers to run autonomous checks and config validations, thus helping you to minimize outages in your Kubernetes cluster.