Intro

AWS and other leading public cloud companies have made infrastructure as a service and it’s subscription based model quite attractive. As more and more companies continue to buy into the benefits of this business model, they are making the next step which is to migrate their applications from datacenters composed of VMs to cloud-native platforms built around Kubernetes.

I was having a discussion with a customer centered around container-based workloads, and the challenges they initially faced after their end-to-end migration from EC2 to Kubernetes. They shared a very common scenario we see play out in organizations that are making the transition from a legacy application to a cloud-native architecture.

The Setup

  • They had a Java service that they needed to package to a Docker container and deploy onto Kubernetes.
  • They provisioned their cluster using Kops, an easy tool for provisioning a production grade Kubernetes cluster.
  • They used EC2 instances running the Amazon Linux AMI for the data plane.
  • Amazon RDS as their main MySQL database.
  • ELB for traffic load balancing.

The Challenge

Before Kubernetes, it was a simple time, simple place where you had one service on one instance. So for example their instance was pretty big, it had 128GB RAM and JVM allocates the service 128GB.

In Kubernetes you have multiple co-located services running in your Pods and the JVM gives each Pod 128GB and now suddenly you are out of memory and your Pods get OOM-Killed, and it’s the same thing for CPU, you can get Out of CPU Errors.

The Event

Soon after completing the migration for the service, they noticed that some endpoints seemed to have slightly higher latency than they were in EC2 VMs. After double-checking that the configurations were the same between EC2 and Kubernetes, they moved forward with the rollout.

Over the course of the next couple of weeks, it definitely seemed like there was some regression in the latency of the endpoints, and the culprit likely increased latency to their downstream storage (dbproxy).

Their DB connections were experiencing P95 latencies of 30ms - 100ms and P99 latencies of between 100ms - 200ms.

When they actually dug into it, they learned that for a specific endpoint, when connecting to databases, they had created a new thread pool per request.

The Root Cause

Why did this work before Kubernetes?

Let’s take a look at how Java services work. If you are running a Java application in a Node all by itself, the JVM might tell the application that hey I have 36 CPUs.

Now you put it in a Pod and it still works.

But now if you have 3 of these Pods running in the same Kubernetes Node, each of them will think they have 36 CPUs, and this is the problem.

This was an open bug.

https://bugs.openjdk.java.net/browse/JDK-8146115

The problem is that older versions of Java were not “container aware” and this is important because Java allocates resources based on how much resources (like CPU cores) it thinks the system has. This affects how it manages resources like thread pools, etc.

The Fix

This bug was fixed in Java 8u191+. You can simply upgrade the JDK and then resources will be allocated based on the container and not the node. Here is a graph tracking it’s p95 latency. After the upgrade, latency went down and that worked pretty well.

Now the JVM gives your Pod the correct amount of CPU but you now have a service with a concurrency bug from asynchronous requests that gets worse with a smaller thread pool (less CPU).

So basically downgrading to the correct resources (cpu) can expose applications to race conditions which led to another outage. They fixed the performance issues by reusing a thread pool in a static context.

Therefore when upgrading the JDK, check for correct thread pool usage, asynchronous thread blocking calls, and other concurrency bugs with your multithreaded programs. As an application owner, please use a canary cluster and test your application before rolling it to production.

This is not specific to Java services, everyone knows that older JVMs are not aware of cgroups. Similar pitfalls can occur in other language frameworks and sidecars. Another example is Envoy which sets concurrency to the number of CPUs on the underlying host by default. That’s also going to cause contention on the host so they had to set concurrency to be lower.

Conclusion

The key takeaway here is to beware of languages, frameworks, and sidecars that are not aware of container abstraction. As we saw here,  container’s promise of “Build once, run anywhere” isn’t 100% accurate. Languages and apps can have deeper dependencies on the underlying systems that they run on. You should therefore upgrade your systems to be container aware.

You can also calculate the impact of your next feature using Kalc, a cluster validator that searches for any configuration that may bring about failure in your cluster. The goal for the project is to create an intent-driven, self-healing Kubernetes configuration system that will abstract the cluster manager from error-prone manual tweaking. Kalc is leveraging AI to solve the Kubernetes complexity nightmare.