Intro

As we all know, running a distributed system can be a messy business. Even as more and more organizations are moving towards Kubernetes microservices architecture systems, debugging these systems can be a major problem. A good platform gives us visibility on what’s deployed, which Endpoints are called, and full visibility on Distributed Traces. The problem is that Kubernetes doesn’t offer that by default.

As the number of microservices within an organization grows to hundreds or thousands, the increased complexity of inter-service communication can become daunting. This is where a service mesh comes in and in this article, we will go over how to find and fix bugs in your service mesh.

The Setup

What is Linkerd?

Linkerd is a service mesh. A service mesh is an infrastructure layer that is responsible for service to service communication. It sits between your application in Kubernetes and it's responsible for the reliable and secure delivery of requests.

So let's say you have your application, it's running some Kubernetes pods.

When you install Linkerd, what you're doing is adding a series of proxies that run as sidecar containers alongside your application.

Traffic is transparently proxied through Linkerd without your application being aware that this is happening.

Traffic flow in a meshed application

Because a service mesh sits next to your application, and it sees all the incoming and outgoing traffic, it can provide you with a lot of useful features like:

  • Automatic / zero-config mTLS between your meshed services
  • Telemetry and Monitoring
  • Distributed Tracing
  • HTTP, HTTP/2, and gRPC Proxying
  • Latency-aware load balancing
  • Retries and Timeouts
  • TCP Proxying and Protocol Detection

The Challenge

PAYBASE is an API driven payment services provider. Their customers are B2B - marketplaces, gig/sharing economies, cryptos, or any sort of business that has very sophisticated payment flows and normal financial institutions cannot handle that flow.

This means that Paybase is in a highly regulated industry - they have a banking license which means that they care deeply about availability, monitoring, and knowing what's happening inside their cluster.

Why did they choose Linkerd?

- The main thing that Linkerd gives them is gRPC load balancing and therefore scalability. 100% of their platform is running in Kubernetes, 50+ microservices talking to each other via gRPC.

- Kubernetes currently does not provide gRPC load balancing by default. This means that if they require scalability, they won't be able to get that without a custom solution - and that's where Linkerd comes into place.

The Event

When they started implementing Linkerd into their cluster, it was a bumpy ride because their application is complex and maintaining a service mesh is not easy either.

What they expected vs What they got

On the left side of this picture, you can see the payload of a request that their application was expecting, and on the highlighted square, you can see how some of the headers should be. Then on the right side, you can see what happened to that request when it got passed to Linkerd proxy. It's evident that the headers were mutated and that's when they started getting some very obscure 502 protocol errors.

The Root Cause

Diagnosing the Bug

Where could the bug be?

Is it...

  • in the application?
  • in the application's dependencies?
  • in the Linkerd control plane? (Golang)
  • in the Golang dependencies?
  • in the Linkerd-proxy? (Rust)
  • in the Rust dependencies?
  • in Kubernetes?
  • in the Linux Kernel?

It was quite a journey to figure out what the issue was. They did their best to come up with a framework.

Troubleshooting Service Mesh Errors Flow Chart

Paybase filed a bug report on Linkerd.

* The initial bug report contained linkerd-proxy logs as well as their application logs  

kubectl logs -f deploy/ foo -n bar
  • Proxy logs linkerd logs
  • There were protocol errors on requests that had gRPC metadata in the headers. However, that wasn't enough to inform the Linkerd team where this was happening

* They asked for further details linkerd tap

  • They examined the requests between services in the application using Linkerd tap. Linkerd tap is a diagnostic tool that lets you tap into a particular resource and see the application's requests, basically to see what the proxy sees at that time

* They dived even deeper:  tcpdump

  • Looked at the raw TCP packets
  • Saw that headers were being split across two frames
  • This is unusual, as headers typically only take up one frame

The Fix

To understand how to fix this bug, you need to know:

  • How does Linkerd use HTTP/2 in the service mesh
  • How does HTTP/2 work
HTTP/2 in the Linkerd service mesh

When service A wants to talk to Service B the traffic doesn't go directly, instead, it goes through the proxies alongside Service A and B. This proxy to proxy to communication uses HTTP/2.

Another thing to take into account is multiplexing. It's a cool HTTP/2 feature where requests are sent on the same TCP connection at the same time.

How does it do this?
  • Each request or response is a message that gets broken down into units called frames and the frames are kind of interwoven with one another and that's how they got sent.
  • Each frame has a stream ID that will tell you what request or response you're a part of and a frame type that will tell you what type of data your frame is carrying

The headers frame as you might have guessed contains headers. If there are too many headers to fill on that frame, you can send another frame type called a continuation frame (you can send more headers in there).

Back to the Paybase bug. Note the repeated headers?

There are a lot of headers and so they suspected a continuation frame was in play since some of the headers are repeated.

Bug #1: Continuation Frame Panic

- The code panicked when a CONTINUATION frame contained a repeated header.

Bug #2: Evicted Table Header Index

- They were looking up repeated headers using the wrong index.

So these bugs were deep in the stack. They were in the HTTPS/2 Rust library that Linkerd uses. After finding the bug, they fixed it and the user was happy.

In future Linkerd hopes to improve its diagnostics by adding:

  • A debug Kubernetes sidecar container that can be deployed into a failing pod to diagnose problems using tshark, tcpdump, lsof, and iproute2
  • More visibility into application traffic with Linkerd tap so that it can view request bodies
  • Tracing and visibility in the Rust libraries they depend on

Conclusion

There was more than one bug! The bugs were deep in the stack. All got fixed fairly quickly due to:

  • Detailed bug reports
  • There was space to test the application with/without linkerd, testing different versions
  • PayBase used the log suggestions in the issue template

If you’re looking for AI implementations of this debug functionality, or if you're struggling to wrap your head around the complexity of testing your nodes, pods, and services, take a good look at kalc.io.