Intro

Kubernetes is the latest and greatest technology for taking containers to the next level. If you are in a situation where you've been using docker kubernetes for a little while and your website or application makes it to the big leagues and is suddenly driving a lot of traffic your way, you need a way to scale up really fast.

So how do you go from managing a few servers to hundreds or thousands of servers? It's certainly more than you can keep up within your mind when you need to scale out your business. At its core, Kubernetes gives you the means to deploy your containers, an easy way to scale, and it gives you monitoring.

That's the exact business challenge that Tinder Engineering was facing when scaling became critical and they needed a way to schedule new VMs and serve traffic within seconds. For those who are not aware of what Tinder is; Tinder is the world’s most popular site with more than 26 million matches per day.

The Setup

Tinder has hundreds of microservices running in AWS EC2 instances behind ELBs, and  because the entire infrastructure is running Kubernetes on AWS, Amazon CloudWatch is used to provide all the helpful metrics.

Tinder uses Grafana as their central monitoring platform for their online and offline workloads. They use it to monitor the health of all microservices running in containers and VMs.

They also use kube-aws for provisioning. Initially, they had one node pool but they quickly separated it into different sizes and types.

They found that running fewer heavily threaded pods together e.g Java yielded better performance than having them colocated with single-threaded workloads e.g Node.js.

Eventually, they settled on a combination of:

  • C5.4xlarge instances for the Masters (3 nodes)
  • C5.4xlarge instances for their etcd cluster (3 nodes)
  • C5.4xlarge for their single-threaded workloads e.g. Node.js
  • C5.2xlarge instances for their multi-threaded workloads in Java and GOs
  • m5.xlarge instances for their memory-intensive applications e.g. monitoring

They are also using Flannel for service-to-service communication.

The Event

On January 8, 2019, they were down for several hours following a scale-up event that had left their cluster at a larger size than ever before. A single cluster setup brings up challenges of scale as opposed to a multi-cluster architecture. Tinder experienced a series of outages during this period that brought down their service.

The Root Cause

They had a series of outages and it was as a result of two things:

  1. The Address Resolution Protocol (ARP) table ran out of available entries since the cluster size had grown so large. It's possible, with very large networks, to exhaust the limits of the ARP cache on the host. So when the pod and node count reached a certain point, it resulted in dropped packets and entire Flannel /24s addresses missing from the ARP tables. At that time they were using Flannel for service-to-service communication.
  2. The other challenge they had to work out was DNS timeouts due to conntrack insertion failures for SNAT and DNAT. The Engineering team was constantly complaining about the error rates e.g. 'Could not connect to service' and 'Could not resolve DNS for a particular service endpoint'.

The issues were amplified by ndots defaulting to 5 and causing many subsequent lookups. Scaling attempts and ndots mitigations helped but they peaked at 250,000 requests/sec which resulted in 120 CPU cores usage spread across 1000 CoreDNS pods.

The Fix

  1. ARP Cache Exhaustion
    They had to raise some of the values via ctl on the nodes themselves to expand the size of that ARP table by raising values for gc_thresh1, gc_thresh2, gc_thresh3 on all nodes. They then restarted Flannel on all the nodes.
  2. DNS Timeouts
    It became clear that no matter how many Kubernetes DNS pods they threw at the problem it wasn't getting any better.

So after running several online searches, they took the issue of the race condition between SNAT and DNAT out of the equation.

They redeployed CoreDNS as a Daemonset and injected node IP into resolv.conf so that a container's first lookup would be on the node itself. It would not fall back into some service this resulted in SNAT and DNAT translation.

Conclusion

Scaling Kubernetes to hundreds or thousands of nodes brings new challenges. You, therefore, need to prepare in advance for these scale-up events by using disruptive tests. Using Kalc, we can simulate thousands of Kubernetes nodes running actual workloads, and our tests are designed to reveal how Kubernetes behaves while managing a complex application on a large scale.