NRE Labs is a site for teaching network automation in the browser using real, interactive, compelling virtual environments. Its main aim is to democratize interactive, dependency-free learning.

The Labs are powered by the Antidote project, which provides a platform for representing curriculum-as-code. The platform is a Javascript Web application which uses elements of the Apache Guacamole project.

When you set out to learn about network automation, oftentimes the hardest part is setting up labs in complex virtual environments. NRE Labs abstracts this process by simply removing this initial barrier, and allowing the learner to focus only on the question, without sacrificing any of the advantages of an interactive dedicated lab.

The Setup

NRE has multiple layers that make up their infrastructure and multiple tools that run or setup each layer.NRE lesson resources are usually provisioned as inter-networked Kubernetes PODs (Kubernetes containers) and are made available for each learner.


In the illustration below, you can see that the platform is able to serve the same lesson to 2 learners in isolation. Each lesson instance contains 2 resources (vqfx1 and linux1), running in their own Kubernetes “namespaces”. A learner is then able to interact with the resources as dedicated virtual machines, using Web consoles connected via SSH to the appropriate lesson instance.

Syringe contains all the operational scripts, configs, playbooks, and packs for managing the Antidote service behind NRE Labs. Antidote-web on the other hand is the Web frontend-end facing the users, and it provides access to Lesson resources in an interactive UI. 

The Kubernetes based architecture allows learners to easily spin up additional compute resources for running the lessons in a Linux virtual environment

The Challenge

For NRE and any cloud based business, what matters most is what the end-user can and can’t do. Loss of end-user access to the application is always a catastrophic event. Let’s say a learner has managed to get into the front-end application, and they are now able to navigate to a lesson, but the lesson just hangs forever at the loading screen. Eventually they will see some kind of error message that indicates the lesson timed out while trying to start.
It is always frustrating when you deploy your application on Kubernetes and see it fail without warning. Is it a problem with the pod network? Pods Pending or in CrashLoopBackoff, images not pulling, Services not serving? Maybe you just ran out of resources. Troubleshooting cluster issues can take up a lot of valuable time.

The Event

Service disruption was exactly what Matt Oswalt, an SRE at NRE, woke up to. Part of his morning routine includes looking at the day’s NRE Labs stats. Here’s what he saw:

Active lessons had tanked to just 2 sessions. Matt started troubleshooting by getting the status of all the Kubernetes pods in the production namespace.

$ kubectl describe pods -n=prod
The connection to the server was refused - did you specify the right host or port?

It was evident that something was definitely amiss, as he couldn’t even connect to the API server. He was however able to connect via SSH to the GCE instance that was running the Kubernetes master node. He noticed there was no mention of the API server in the output of docker ps.

$ kubectl ssh docker ps

Matt later checked on all of the pods in kube-system - these are all the services that power the cluster itself.

$ kubectl get pods -n=kube-system
NAME                                               READY     STATUS    RESTARTS   AGEcoredns-78fcdf6894-8q7tg                           1/1       Running   0          3d coredns-78fcdf6894-wh5rc                           1/1       Running   1          3detcd-antidote-controller-4m3j                      1/1       Running   183        58dkube-apiserver-antidote-controller-4m3j            1/1       Running   183        58dkube-controller-manager-antidote-controller-4m3j   1/1       Running   91         58d

If you carefully read the output in more detail, you will notice that the etcd and API server pods restarted a whopping 183 times.

While still operating in the master node he ran the following docker command;

$ sudo docker ps -a

This command revealed that etcd was constantly starting, and shutting down. The log output of these terminated containers showed a very normal etcd startup process.

2018-12-04 23:24:34.730415 I | embed: serving client requests on
2018-12-04 23:25:56.161691 N | pkg/osutil: received terminated signal, shutting down...

The kubelet logs seemed to indicate that it was responsible for restarting the etcd pod.

Dec 04 22:00:12 antidote-controller-4m2d kubelet[45053]: I3426 22:00:12.245101   35066 kuberuntime_manager.go:513] Container {Name:etcd...truncated....} is dead, but RestartPolicy says that we should restart it.

The Root Cause

Kubernetes liveness probes are a really valuable tool for ensuring your pods are actually working. If we take a peek at the etcd definition in use at that time, we see an interesting section:

        - /bin/sh
        - -ec
        - ETCDCTL_API=3 etcdctl --endpoints=https://[]:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt
          --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt --key=/etc/kubernetes/pki/etcd/healthcheck-client.key
          get foo
      failureThreshold: 8
      initialDelaySeconds: 15
      timeoutSeconds: 15

Rather than just relying on whether or not the process inside the pod is running, liveness probes allow us to execute certain commands, for example sending HTTP requests to confirm that the application is actually working.

In this case, kubeadm had created a liveness probe that checks for the presence of a key foo. If it’s able to request for the key successfully, it marks the Pod as healthy, and kubernetes leaves it alone. Should it fail, the kubelet will attempt to restore normality by restarting the pod.

We can see that failureThreshold is set to 8 seconds, and timeoutSeconds is set to 15 seconds. Multiply these together and you have 2 minutes - the exact time he was observing the pod restart behavior.

Something about this liveness probe was failing, and the kubelet was doing exactly what it should be doing - restarting the pod in order to restore normality. Even though it didn’t appear that etcd was actually broken, for some reason this liveness check was failing.

The Solution

The etcd health check was failing, but he couldn’t see etcd connectivity issues when directly hitting the etcd health endpoint from the master node. Since it appeared this was the only thing that wasn’t working, he elected to simply exclude etcd from the liveness health check for the kube-apiserver on GCE, Etcd wasn’t actually unhealthy, the kubelet just thought it was.

He edited the kubernetes manifest by removing the entire liveness probe section. Then, he waited for the next etcd restart, and quickly applied the changes with kubectl apply.

Even though this action fixed the outage, it is not an elegant solution — It takes time and a lot of trial and error to find the smoking gun. What if I told you that there is a tool that can help you to predict and fix Kubernetes issues.

kubectl-val is a tool that uses AI to predict risks before they affect your production environment. It will validate your kubernetes clusters' configurations using AI planning.

In a failure scenario, kubectl-val leverages on automated planning and scheduling strategies to deliver action sequences via YAML-encoded steps. Possible actions might include stopping the deployment, scheduling more Kubernetes pods, etc.


In the NRE Labs incident, it was established that the liveness probe was causing termination. Adjusting the timeout period in the liveness probe changed the termination cycle and fixed the cluster. However, it’s still not clear why the liveness probe check was failing.

As we have seen, you can use kubectl-val by Kalc to predict risks before they hit your cluster. It is a fully autonomous tool that runs config validations and anticipates failure scenarios before they even affect your Kubernetes cluster. Kubectl-val evaluates all the possible events that could lead to an outage by analyzing your cluster state against a knowledge base of config scenarios.