Cloud-native applications are often designed as a batch of distributed microservices, which run in Containers. Today Kubernetes has become the go-to solution for deploying and orchestrating containerized applications. It has a rich set of APIs that abstract away the underlying hardware infrastructure by acting as a distributed operating system for your cluster.
Airbnb uses a service discovery system called SmartStack, which is an automated service discovery and registration framework which was open-sourced in 2013. It has several components which include ZooKeeper, HAProxy, Synapse, and Nerve.
ZooKeeper is a key-value store that tracks the cluster state. Nerve is responsible for handling service health checks. Synapse is responsible for querying ZooKeeper for service providers. At its core, their setup involves having an HAProxy sidecar container that is outbound or reverse proxied in each Pod.
As companies continue to adopt container technology they are increasingly encountering challenges with running microservices efficiently. They find themselves flying blind with zero visibility into their containers. Luckily there are open source service mesh management frameworks like Istio, Linkerd, and SmartStack which provide visibility and easy management for service-to-service communication within Kubernetes clusters.
They had this problem where they experienced some outages related to the service mesh. They came up with a simple fix where they slowed down the service mesh discovery so that they only had to restart HAProxy no more often than once every 30 seconds.
Since most developers don’t like waiting and want to save as much time as possible, Kubernetes has this great feature where it can deploy really fast. They cranked up maxSurge on their deployment strategy and they were able to deploy the entire service in less than 30 seconds.
Now the service mesh couldn’t keep up so each time the service would deploy they’d see a spike in errors.
The Root Cause
The SmartStack system will periodically update the HAProxy config and restart it to keep things up to date. Unfortunately, restarting HAProxy is memory intensive. The way that works is that HAProxy forks and starts a new copy of itself taking up an equivalent amount of memory as it’s first copy. The original copy will finish any outstanding connections that are remaining and then terminate.
So your memory usage can double when HAProxy is running or if your connections are long-lived and your service mesh is changing frequently. You then run into an issue where a lot of HAProxies running and so you run into an OutOfMemory error.
The fix for that is to tell Kubernetes to slow down the termination of pods on deploy. You add a preStop hook to sleep and wait long enough for your service mesh to catch up and figure out what happened.
This will put the Pod into terminating state and Kubernetes will delete it by default after 30 seconds.
# Give yourself 120 seconds to wait for SmartStack to realize that this pod # is gone and stop sending traffic to it, before termination
- name: main
So you also need to set terminationGracePeriodSecond so once you have those two settings you can get the Pods to survive long enough to stay around until your service mesh catches up with what happened.
We have seen that Kubernetes deploys can cycle Pods super fast, whether the rest of your infrastructure can keep up or not. When making any changes to your cluster, it is recommended to run a Kubernetes cluster validator such as Kalc. Kalc reasons against the current state of your cluster and intelligently resolves conflicts and potential issues with your intended changes. So when making changes to your cluster which in this example is introducing a service mesh, Kalc will estimate the cost of this feature, therefore, eliminating manual guesswork.