When we talk to customers about containerizing, modernizing their applications we always ask them why they want to use Kubernetes. Most of the time Kubernetes ends up being the answer but we want to emphasize that Kubernetes is not a golden hammer.
In this article, I intend to cover the number one way in which Kubernetes on AWS EKS has failed. The number one problem in Kubernetes is generally related to Networking.
Kubernetes doesn't provide its own networking stack - you can bring in your network provider and in EKS they use the Amazon VPC CNI which gives you the ability to give your Pods a VPC IP address.
What is a CNI?
The Container Networking Interface consists of libraries for writing plugins to configure network interfaces in Linux containers. Kubernetes on AWS uses the amazon-vpc-cni by default and that gives you the ability to give your Kubernetes Pods and your Worker Nodes the same IP address from a subnet within your VPC.
If you have ever used KOPS or any other thing, you might be familiar with things like Calico, Flannel, or Canal. These are overlay networking technologies that include a little bit of compute overhead to run on the Worker Nodes but you also have to deal with another layer of IP addressing that you don't have to deal with using amazon-vpc-cni.
If you look at how the CNI runs in EKS, it basically uses the Elastic Network Interface (ENI) which attaches to the Worker Nodes and it basically provides the networking capabilities to an EC2 instance.
Each EC2 instance gets an ENI attached to it and amazon-vpc-cni runs on top of ENI. This simply means that it uses ENI to give IP addresses to the Pods and EC2 instances.
Pods receive an IP address directly from the subnet - so that means that you now have a limitation of the total number of Pods that you can have in your cluster. The first limitation is the size of the subnet so if you only have 100 IPs in that subnet, guess what, you can only create 100 Pods. Any Kubernetes Pod that's going to be created after that is going to remain in Pending state unless you add more IP addresses or increase the instance size.
So you should then pay attention and plan for growth. Before you start a cluster, make sure that you get an idea about what is going to be that upper limit for your Pods and then create a Network accordingly.
Let's reproduce this issue
EKSCTL Configuration File
- name: nodegroup
# enable specific types of cluster control plane logs
enableTypes: ["audit", "authenticator", "controllerManager"]
# all supported types: "api", "audit", "authenticator", "controllerManager", "scheduler"
# supported special values: "*" and "all"
Create EKS Cluster
$ eksctl create cluster -f resources/manifests/eks-cluster.yaml
Create a Deployment
$ kubectl create -f resources/manifests/helloWorld-deployment.yaml
Since we are simulating a high transaction application, I have to make sure that it runs at scale. So what I will do is scale our replicas to 240.
$ kubectl scale --replicas=240 deployment helloWorld
Now let's try to get our Kubernetes deployments
$ kubectl get deployments
I went to get the deployments and I was very happy that 222 were adopted but the remaining remained in Pending State.
The good thing is that Kubernetes works but I was expecting unlimited scaling where I would just go ahead and scale Kubernetes to millions of Pods and I should have that but it didn't happen. I only have 222 Pods, so what went wrong?
The Root Cause
I tried to get Pods and it seems that some those Pods maybe in Pending State.
Get Pending Pod
$ kubectl get pods --field-selector=status.phase==Pending
I see a bunch of them are Pending.
I used field selectors which come in handy in large clusters to filter all the Pods that are in Pending State.
I want to troubleshoot a little more with Get Events which will give me the events in the Kubernetes cluster.
$ kubectl get events --field-selector
11m Warning FailedCreatePodSandBox
Failed create pod
sandbox: rpc error:
code = Unknown desc = failed to set
up sandbox container "2f43374edb1fdc76
f04bab8fe4" network for pod
NetworkPlugin cni failed to set up
network: add cmd: failed to assign an
IP address to container
I see that there is an interesting Pod in my helloWorld application, and I see that it says failed to assign IP address to Kubernetes Container.
This is the most common issue that people run into. I looked at the AWS EKS Documentation for this nifty component ipamD (IP Address Management Daemon) which is responsible for IP address allocation.
It basically does 2 things. Firstly it maintains a warm pool of available IP addresses from the VPC so let's say it will cache 10 - 15 addresses and as Pods are created, it's going to assign an IP address to each one of them, and as Pods get destroyed it's going to then take that IP address and store it at it's warm pool.
As mentioned before, ENI uses 1 IP address for communication. There is a simple formula for the max number of IPs you can get in an EC2 Instance;
You take the number of ENIs, multiply that by the number of IP addresses that ENI supports, subtract 1 (because that's the one it uses to communicate back), and that's the number of IP addresses that you can get on that EC2 Instance.
Maximum number of IPs = min((number of ENIs * (IP addresses per ENI - 1)), free IPs in the subnet)
So in my cluster here, I used an m5.xlarge EC2 Instance type. It can support up to (four) 4 ENIs, and each ENI supports 15 IP addresses.
For a m5.xlarge Instance, the maximum number of IP addresses per host by ENIs is 56=4 x (15-1). It can support 56 IP addresses.
The default cluster that I created in the default Subnet supports up to 8192 IPs
18.104.22.168/19 => 8192 IPs
So the maximum number of Pods I can run on this host is 56 (it's directly related to the max IPs per host)
Kubernetes V1.16 recommends that you should not run more than 100 Pods per Node. In this situation I have 4 EC2 Instances, 4 m5.xlarge, so that means that 56 x 4 is going to be 224 - and that's why my Pods stopped at 220 because we also have 4 ENI Pods in use (consuming 4 IP addresses).
The number 1 thing to keep in mind is that you always want to make sure that you are planning ahead and that you have enough IP addresses. If you don't, your application scaling at one point is going to come to a halt and your cluster will break meaning your apps won't scale.
To do that, AWS created the CNI Metrics Helper. The CNI Metrics Helper helps you track how many IP addresses have been assigned and how many are available.
It's a very easy tool which you can deploy in your cluster, and it can report back metrics like;
- What are the IP addresses that you have available?
- Are there any errors in my cluster?
- Max number of ENIs that the cluster can support.
- The number of IP addresses currently assigned to Pods.
The cni-metrics-helper creates a Pod inside the Kube-System namespace which is the metrics helper. This metrics helper collects information and applies that for use in Cloud Watch.
Create CNI Metrics Helper Policy
aws iam create-policy \
--policy-name CNIMetricsHelperPolicy \
--description "Grants permission to write metrics to CloudWatch" \
Attach policy to the worker nodes IAM role:
$ ROLE_NAME=$(aws iam list-roles \
'Roles[?contains(RoleName,`debug-k8s-nodegroup`)].RoleName' --output text)
aws iam attach-role-policy \
--role-name $ROLE_NAME \
Deploy CNI Metrics Helper
$ kubectl apply -f \
$ kubectl get deployment cni-metrics-helper -n kube-system
NAME READY UP-TO-DATE AVAILABLE AGE
cni-metrics-helper 1/1 1 1 60s
Using Cloud Watch you can visualize all this information.
We’ve seen that EKS uses the default AWS-VPC-CNI to assign private IP addresses to every Kubernetes Pod. The max-pods are set to the max IPs that is available for assignment to pods on that instance type. So, if you run out of free IPs in your VPC subnet, your Pod won't be scheduled and it will remain in Pending State.
Monitoring your EKS cluster with Kalc that uses machine learning will help you to predict these kinds of problems before they actually happen. By leveraging an AI monitoring system, your organization can use machine learning to track a wide spectrum of events in real time. This information can be inferred to alert you when needed. This enables your team to focus their efforts on mission-critical tasks.