Apache Cassandra is a highly scalable free and open source NoSQL database, achieving great performance on multi-node setups with no single point of failure. Cassandra supports replication across multiple data centers and offers lower latency for users and the ability to survive regional outages. The largest production deployment today is Apple's, with over 75,000 nodes storing over 10 PB of data. This is a story of how an SRE team broke a 200+ Node Cassandra cluster.
Prior to the incident, they had deployed a Cassandra cluster and everything went up fine and was working perfectly. They selected 210 instance types that had similar CPU and Memory specifications and deployed them using a single Auto Scaling Group that spans 3 Availability Zones. ASGs makes it easy for you to drain your container instances when there’s a “scale-in” activity However, when they got back the next morning, this cluster was completely broken and nobody had done anything.
They dug into the details and found out that 25% of the Pods were pending and they had a “Volume affinity issue” which was quite unusual.
They went deeper and it turned out that 25% of the Nodes had been deleted. Since they were using local volumes for the cluster, these Pods were not schedulable because they couldn’t find their persistent volumes.
Losing 25% of the Nodes (~50 Nodes) is kind of a big deal. Their setup is Cloud Native and naturally, they ought to be resilient to Node failures. They were pretty confident that their Cloud Provider (AWS), had deleted the Nodes that night.
The Root Cause
What had happened is that they had deployed the Cassandra cluster on an ASG(Auto-scaling Group), a single one over 3 Availability Zones.
Auto Scaling allows you to take advantage of the reliability and safety of geographic redundancy by spreading ASGs across multiple Availability Zones within a region. When one AZ becomes unavailable, the ASG launches new instances in an unaffected Availability Zone. When the unhealthy AZ returns to a healthy state, the ASG automatically assigns the application instances evenly across all of the appointed Availability Zones.
When they did the deployment, there were capacity issues on one of the AZs and only 30 Nodes were available. So what AWS had done was create more Nodes in another AZ where there were no capacity issues (which was fine). But during the night capacity became available on the second zone and so the ASG, (which is actually quite clever), kicked in and tried to rebalance so that it distributed instances evenly across all AZs (70 per zone). It did so by deleting some of the Nodes and because EBS Volumes are zonal, our Pods were scheduled on the wrong AZ where their EBS volume does not exist.
As of today, there are two workarounds to prevent this outage. They are now using a different ASG for each Availability Zone which is a lot safer. In a worst-case scenario, if they lack capacity in a zone, at least they know that all the Nodes are not going to be provisioned meaning they won’t be able to create the cluster.
This is good because they can just wait for AWS to sort things out (get back to normal) and avoid this kind of surprise. It increases the chance of keeping the desired capacity if some pools are interrupted when EC2 needs the capacity back.
They are also using the kubectl-val cluster config validator which simulates a Kubernetes environment and checks if a failure condition is reachable within your current cluster setup. This tool automates Kubernetes validation using AI meaning you get the same results just as if a professional Kubernetes SRE examined your cluster state.
I hope you found this post useful and that it will help you to safely use ASGs in your Kubernetes clusters to achieve High Availability. As I mentioned, you can use the Kalc Kubernetes cluster validator to get predicted failure reports which can help you understand whether all the scaling will work. You will instantly know if the proposed change runs the risk of evicting pods, breaking labels or starving resources.