Intro

In today's cloud world, you are no longer just worried about machine failure but you also have to account for data center failure. UPS system failure, cybercrime, human error, natural disasters, and faulty generators all rank as the top 5 culprits of data center outages. In November 2019, AWS, Microsoft Azure and Google Cloud all experienced outages in the same week. While it holds that all systems are vulnerable to failures, sometimes an outage in a data center can take down a customer's service, while another customer's service continues to operate because they invested more on redundancy.  In this post, we will look at how an AWS zone outage took down the Weave Cloud service.

The Setup

Kubernetes is complex, ideally when rolling out major or minor changes, you'd like to make a pull request and just go to a URL to see your app change. Weaveworks solves this problem by providing a simple and consistent way to manage containers and microservices through the use of technologies that support cloud-native development. They encourage app developers and DevOps teams to push code and not containers using automated continuous delivery pipelines, observability, and monitoring tools.

Weave Cloud has been running on Kubernetes since 2015. It is a service for deploying, exploring, and monitoring microservice-based applications. They use declarative infrastructure such as Kubernetes, Docker, Terraform, etc. All these systems including code, config, monitoring rules, dashboards, are described in GitHub with a full audit trail.

The Challenge

When Weave Works launched Weave Cloud, they designed it to serve their static UI assets from an Nginx service hosted on a Kubernetes cluster. These assets were baked into the container image by their CI pipeline. This meant that they could perform rolling upgrades and rollbacks on the UI by doing rolling updates and rollbacks on the Nginx service.

As a result of this design, rolling upgrades of the Nginx service could result in page load errors. This is due to their UI assets being named after their content hash. When performing rolling upgrades, an index.html was often loaded from the latest version of the Nginx containers while requests for the assets would be sent to the older version causing the assets to fail to load.

They wanted a system that would manage rolling upgrades without any potential for page load errors. To solve this issue, they uploaded the assets to AWS S3 as part of their CI pipeline. Now they could reference that bucket from their index.html, and serve the pages using Nginx.

The Event

On 28th Feb, Weave Cloud suffered an outage that lasted 4:23hrs. Users could not load https://cloud.weave.works for >1hr. They just got a blank page. The blank UI was caused by them storing UI assets (CSS, JS, images) on S3.

Prometheus alerted Weave Cloud Engineering that their dependencies hosted on S3 had stopped working. They had recently introduced Prometheus monitoring into their client-side Javascript code and this automated alerting detected the issue in around ~5mins.

The Root Cause

The reason for this outage was because of a total outage of AWS S3 in the us-east-1 region where Weave Cloud static assets were deployed. S3 was down for >4hrs hours, causing Weave Cloud to be inaccessible for some of that time, with degraded service for the rest of that time.

The S3 outage was caused by operator error. An S3 team member using an established playbook executed a command which was intended to cordon a handful of servers used by the S3 billing process. As fate would have it, one of the inputs to the command was entered incorrectly and a much larger group of servers was removed than was initially intended.

The Fix

This outage presented multiple challenges as CircleCI (their CI service) was also affected by the S3 outage. Several engineers tried to build a non-S3 version of the Kubernetes Nginx UI server, but Docker Hub (where the Nginx base image is hosted) was also experiencing issues so base images could not be pulled.

Luckily for them, one of their engineers had the requisite base images to rebuild the UI image without Docker Hub working. The engineer was able to create a script that would push the new Nginx image to all prod nodes over ssh using docker save/load. This restored service for the Weave Cloud UI.

This outcome suggests that customers must adopt a more sophisticated approach to ensure maximum uptime. It underscores why companies should deploy workloads across multi-region and multi-cloud infrastructure. Weave Cloud is now considering to spread their bets across multiple public clouds by having a second region hosted on a different cloud provider, such as GCP. This they hope will minimize the impact of a zone/region outage.

Conclusion

Cloud providers have aggressively pursued region and zone expansions to help with DR and HA. The onus is now on the customers to map out a strategy that can take advantage of the expanded footprint. Kalc offers another strategy for managing DA and HA by simulating the worst-case scenarios into your clusters to test the cluster's ability to respond to them. It offers a more intelligent alternative to chaos engineering where you are intentionally breaking things in an effort to learn how to build more resilient Kubernetes clusters.