Intro

In Kubernetes we have this concept of Pods. A Pod is a grouping of Kubernetes Containers (such as Docker containers) on the same host. It can be one or more. A good example of this is when you an application server in a Pod and you also have monitoring and logging containers co-located on the same host. Kubernetes describes it as the smallest deployable unit that can be created and managed. Pods by default are designed to be ephemeral which simply means that they are not meant to survive node failures, scheduling failures, or other evictions.

In this post I want us to look at the different Pod phases and how that affects your application on Kubernetes.

The Setup

Let's look at the Pod Lifecycle and the 5 Pod Phases. These are:

  • Pending
  • Running
  • Succeeded
  • Failed
  • Unknown

The only happy stages are Running and Succeeded - that means that your Kubernetes Pods are running well; but even in running, depending on how many containers are running in the Pod it might tell you 1/1, 2/2, 1/2 e.t.c So you have to pay close attention to those numbers because if it is 1/2 then that means one container in that Pod is not running.

You want to make sure it is x/x and not x/y where x and y are different numbers.

Succeeded means you are probably running a Job and then that Job has completed and it has succeeded, so that's good as well.

Unknown typically happens when the kubelet is unable to communicate with the API Server.

Failed most probably happens when you are unable to pull the image - and that's where the Pod fails “Image Not Found”.

The Challenge

In this article I want us to look at the different reasons why our Pods might be in Pending State.

To reproduce this issue, I will use a hello world deployment and I want to scale it to 8 replicas.

$ kubectl create -f helloWorld-deployment.yaml

deployment.apps/hello created

Now the challenge comes when the deployment shows that only 4 replicas are available.

$ kubectl get deployments

NAME         READY   UP-TO-DATE   AVAILABLE   AGE
helloWorld   4//8              8                     4             23s

The Event

When I try to list all the Pods and I say kubectl get Pods - it says well sure 4 Pods are Pending.

$ kubectl get pods
. . .
=======
NAME                          READY   STATUS    RESTARTS   AGE
helloWorld-6d4fbd5d76-9xqxg   1/1     Running   0          5s
helloWorld-6d4fbd5d76-brv7k   0/1     Pending   0          5s
helloWorld-6d4fbd5d76-hbf8h   0/1     Pending   0          5s
helloWorld-6d4fbd5d76-jdzlw   1/1     Running   0          5s
helloWorld-6d4fbd5d76-jqsfk   0/1     Pending   0          5s
helloWorld-6d4fbd5d76-k29gb   1/1     Running   0          5s
helloWorld-6d4fbd5d76-vjr62   0/1     Pending   0          5s
helloWorld-6d4fbd5d76-z69pp   1/1     Running   0          5s

I just created a deployment with 8 replicas and I can see that only 4 replicas are available. It's not a happy use case, so let’s go into deeper details of why the Pod could be Pending and the things we could do basically to work around that.

The Root Cause

What are some of the reasons why my Pods may be Pending?

Well there could be multiple reasons for a Pending Pod

  1. Not enough resources in the cluster, that is, CPU, Memory, Ports e.t.c.
  2. Not enough IP addresses.
  3. Unhealthy Nodes where for example you think you have an 8 Node cluster  while in the real sense what you have is a 4 Kubernetes Node cluster

Let's look at 1 Pod that is not healthy

$ kubectl describe pod/helloWorld-6d4fbd5d76-brv7k
. . .
Events:
Type  Reason    Age                From Message
----  ------    ----               ---- -------
Warning FailedScheduling  42s (x2 over 42s)  default-scheduler  0/4 nodes are              available: 4 Insufficient cpu.

It says FailedScheduling is the reason for that and it says 0/4 Nodes are available - insufficient CPU.

I can also say get me all the events.

$ kubectl get events

LAST SEEN   TYPE      REASON             KIND   MESSAGE
3m16s       Warning   FailedScheduling   Pod    0/4 nodes are available: 4 Insufficient cpu.
3m16s       Warning   FailedScheduling   Pod    0/4 nodes are available: 4 Insufficient cpu.
3m16s       Warning   FailedScheduling   Pod    0/4 nodes are available: 4 Insufficient cpu.
3m16s       Warning   FailedScheduling   Pod    0/4 nodes are available: 4 Insufficient cpu.

So it appears there is not enough CPU available. Let's look at the Memory/CPU requirements - what is the reason why my Pods may not be running?

$ kubectl describe deployments/helloWorl

Containers:
  helloWorld:
    Image:      nginx:latest
    Port:       <none>
    Host Port:  <none>
    Limits:
      cpu:     2
      memory:  2000Mi
    Requests:
      cpu:        2
      memory:     2000Mi
    Environment:  <none>

If I describe the Pod, I can see that it is the nginx:latest image that I'm trying to deploy.

If I look at the limits, it says 2 CPUs and 2GB of RAM. Requests is what it will initially get and limit is the max value that it will go to.

If I'm doing 8 Replicas of the Pod, then I need a whole bunch of memory. Essentially I'm looking at 16 CPUs and 16GB of RAM.

But hey, I'm running a 16 CPU cluster, why are those CPUs not available to me? I can use this command which shows the total CPU/Memory utilization on my cluster.

$ kubectl top nodes
NAME  CPU   CPU%   MEMORY(bytes)   MEMORY%  
ip-192-168-28-108.us-central-1.compute.internal  28m    0%     410Mi           2%
ip-192-168-48-190.us-central-1.compute.internal  33m    0%     363Mi           2%
ip-192-168-51-148.us-central-1.compute.internal  29m    0%     338Mi           2%
ip-192-168-64-166.us-central-1.compute.internal  32m    0%     395Mi           2%

We are using 28m or 0.28% of CPU and 410Mi of memory out of the 16GB that you have available. Now all 16 CPUs are not available to me why?

Well, there is this concept of capacity memory and allocatable memory. Capacity memory is basically how much capacity is available and allocatable memory is sure you have a capacity of 16GB but certain amount of memory is already in use by kubelet, OS and other kube-system processes.

Capacity Memory

$ kubectl get no -o json | jq -r '.items | sort_by(.status.capacity.memory)[]|[.metadata.name,.status.capacity.memory]| @tsv'

ip-192-168-28-108.us-central-1.compute.internal  15950552Ki
ip-192-168-48-190.us-central-1.compute.internal  15950552Ki
ip-192-168-51-148.us-central-1.compute.internal  15950552Ki
ip-192-168-64-166.us-central-1.compute.internal  15950552Ki

Allocatable Memory

$ kubectl get no -o json | jq -r '.items | sort_by(.status.allocatable.memory)[]|[.metadata.name,.status.allocatable.memory]| @tsv'

ip-192-168-28-108.us-central-1.compute.internal 15848152Ki
ip-192-168-48-190.us-central-1.compute.internal 15848152Ki
ip-192-168-51-148.us-central-1.compute.internal 15848152Ki
ip-192-168-64-166.us-central-1.compute.internal 15848152Ki

Capacity CPU

$ kubectl get no -o json | jq -r '.items | sort_by(.status.capacity.cpu)[]|[.metadata.name,.status.capacity.cpu]| @tsv'

ip-192-168-28-108.us-central-1.compute.internal  4
ip-192-168-48-190.us-central-1.compute.internal  4
ip-192-168-51-148.us-central-1.compute.internal  4
ip-192-168-64-166.us-central-1.compute.internal  4

Allocatable CPU

$ kubectl get no -o json | jq -r '.items | sort_by(.status.allocatable.cpu)[]|[.metadata.name,.status.allocatable.cpu]| @tsv'

ip-192-168-28-108.us-central-1.compute.internal  4
ip-192-168-48-190.us-central-1.compute.internal  4
ip-192-168-51-148.us-central-1.compute.internal  4
ip-192-168-64-166.us-central-1.compute.internal  4

You can see that there is a difference between Capacity and Allocatable Memory.

The Fix

So, there is enough memory and CPU. Why are the Pods not getting scheduled?

Now Kubernetes says we don't have enough CPU, so what we need to set up is a Cluster Autoscaler.

It serves two purposes:

  1. It checks if any Pods are failing due to insufficient resources, and it then scales out the cluster for you.
  2. It recycles nodes that are underutilized for an extended period of time. So let's say you have a 10 Node cluster and it recognizes that hey all the Nodes are running about 20% CPU, so it will say, Pods are stateless, I'm going to move them to another Node and I'm going to resize your Nodes for optimum CPU utilization.

One thing to understand is that by default the cluster autoscaler is reactive in the sense that it looks at ooh by the way the Pods are not getting scheduled and now I need to scale them.

You can actually use ASG policies to make it proactive so that for example if CPU utilization goes above 50%, it can automatically scale your cluster and you can add more Nodes to it.

So now after installing it and setting the limits, if we check the Pods, all the Pods are running.

$ kubectl get pods
. . .
NAME  READY   STATUS    RESTARTS  AGE
helloWorld-6d4fbd5d76-9xqxg  1/1     Running   0          6m30s
helloWorld-6d4fbd5d76-brv7k  1/1     Running   0          6m30s
helloWorld-6d4fbd5d76-hbf8h  1/1     Running   0          6m30s
helloWorld-6d4fbd5d76-jdzlw  1/1     Running   0          6m30s
helloWorld-6d4fbd5d76-jqsfk  1/1     Running   0          6m30s
helloWorld-6d4fbd5d76-k29gb  1/1     Running   0          6m30s
helloWorld-6d4fbd5d76-vjr62  1/1     Running   0          6m30s
helloWorld-6d4fbd5d76-z69pp  1/1     Running   0          6m30s

Conclusion

In this post we looked at the Pod Status object and all possible values that it can have. These are Pending, Running, Succeeded, Failed and Unknown. We went deeper to find out the various reasons why Pods can get stuck in Pending State. In this particular case, it was because there weren’t enough CPU and Memory resources in our cluster.

At Kalc, we have created a Machine Learning algorithm on CPU and Memory utilization based on data that we’ve collected from thousands of Kubernetes environments. This very sophisticated computer model helps Cluster Operators to improve cluster observability and monitoring for a fully autonomous workflow. You can level up your Kubernetes monitoring approach with our game changing tool by reaching out to our team and requesting a demo.