Zalando is an e-commerce store that provides lifestyle and fashion products to customers in seventeen European markets. Zalando is considered the starting point for fashion in Europe, and it currently offers more than 300,000 products, with 2,000 different brands in fashion and lifestyle. Zalando uses Kubernetes on AWS to showcase their products to customers. On 7 January 2019, the whole website of the Zalando Fashion Store returned a higher number of errors to customers for over one hour. In this post, we’ll discuss an incident caused by an outage of the Kubernetes cluster DNS configuration and how Zalando’s engineers recovered it.
Zalando’s fashion store was running with a default Kubernetes DNS configuration that affected by the “ndots 5 problem.” It resulted in 10 DNS queries for each DNS lookup of the cluster-external name by the application. The application service was a Node.js that used DNS resolution in Node.js without using any DNS lookups, resulting in more DNS lookups than other apps.
On Monday, 7 January, 2019, the whole website of Zalando showed a higher number of errors to their customers for one hour. All the errors came from the main aggregation layer service that was running in the Kubernetes cluster and caused by an outage of the DNS configuration in the Kubernetes cluster.
When the downstream services from the aggregation layer called out, the incident started, and the aggregation layer returned 404 errors. Due to this, clients tried again and again, and this created a spike in requests to the aggregation layer.
The spike also resulted in DNS queries (CoreDNS).
The spike further resulted in the memory usage of CoreDNS pods that led to the pods being out of memory (OOM) and killed simultaneously.
The CoreDNS pods couldn’t recover from the initial out-of-memory (OOM) and still remained OOM and killed, which led to the total outage of DNS in the Kubernetes cluster.
Note: Graphs were taken at a later point (not from the actual incident) when CoreDNS scaled to handle the spikes.
The Root Cause
The aggregation layer opened circuit breakers to downstream due to the DNS outage and was unable to resolve hostnames as there was no DNS caching available.
Due to this, Zalando’s internal monitoring system for the Kubernetes cluster went down completely and it could not communicate with the external services for pushing metrics and triggering alerts. It took longer than expected to inform the Kubernetes engineer to identify the issue, and when informed, he identified the OOM of CoreDNS pods and changed the memory requests/limits manually from 100Mi to 2000Mi.
Once the Kubernetes engineer increased the resources, the CoreDNS pods recovered to handle the workload again from the Kubernetes cluster and Zalando’s Fashion Store was ready again for customers.
Zalando got a better understanding of DNS-related issues while improving the DNS configuration in their Kubernetes environment, also described in kubernetes issue and the ndots 5 problem. They were already acquainted with such issues, but they did not fully understand the severity of the DNS outage that impacted their Kubernetes deployment.Besides this, they realized that their monitoring system is also no longer strong enough. They’ve configured some alerting mechanisms outside their clusters, but most of the alerting mechanisms ran inside the Kubernetes cluster that resulted in a delay in calling the on-call Kubernetes engineer during DNS outage.
They are introducing a more resilient DNS infrastructure that will run CoreDNS with dnsmasq on every node of the Kubernetes cluster. They also learned that they should improve their alerting system from outside the clusters to improve the incident response.
The other way to prevent such an issue is to use the Kalc Kubernetes config validator. With Kalc, you can minimize your Kubernetes issues by running autonomous checks and config validations and predicting other risks before they affect the production environment. The AI-first Kubernetes Guard navigates through a model of possible actions of events that could lead to an outage by modeling the outcomes of changes and applying growing knowledge of config scenarios.