Kubernetes adoption is exploding and this is due to it being a great platform for running your applications. Kubernetes itself is a stateful platform - so there is data associated with that. As with all data processing applications, you need to provide some data protection capabilities. In this post I want us to go over a case where a user lost some data. This case was quite surprising because they restarted kubelet and they lost data on all their volumes.
Kubernetes has this concept of volumes. A volume is a directory that can be accessed by the containers in a pod. Persistent volumes are designed to survive machine outages by using NFS, iSCSI, GCE Persistent Disk, AWS EBS, etc. This allows your storage to outlive your pod which simply means that if your pod or the node itself goes down you will still be able to preserve your data.
A pod spec typically has Kubernetes Volume Mounts which lists the mounts that it wants. It also references a Persistent Volume Claim and the pod can access storage using this claim.
The yaml file for this looks like this:
- mountPath: /cache
Containers are not designed to be persistent. You might have probably also heard that your infrastructure should be stateless and immutable - and there are some really nice properties from that. The problem is, there is no such thing as a stateless architecture. Assuming that you want to do something useful with your application, you need to store data somewhere.
This is the reason why enterprises that have already bought into the idea of container orchestration are demanding to have stateful Kubernetes deployments. This is also why Kubernetes introduced the volume plugin system that enables Kubernetes workloads to use remote storage systems to persist data. However, using the persistent storage feature can pose challenges as one enterprise client found out the hard way.
Data on PV wiped after kubelet restart
The kubelet went offline in this case due to a crash, but that doesn't really matter it could also go offline due to a kubelet update etc.
So kubelet went offline and at the same time something deleted a Pod in the API Server and the newly restarted kubelet basically wiped out all the data in the volume of the Pod that was just deleted.
The Root Cause
kubelet uses caches to track which volumes are mounted, where they are mounted, and on which pods.
However, these caches were not restored when the new kubelet was restarted because the pod was not present in the API Server and kubelet had no idea or source of information on where the volume was mounted.
At the same time, kubelet has a regular routine where it garbage collects data on deleted pods. In this case, the data collected and volumes were mounted. However, they had not been unmounted yet.
So what kubelet did was to recursively remove all the data in the garbage directory. But since there was still a volume mounted there, it removed all the data from the volume.
First they looked at all the places that kubelet recursively removes data “os.RemoveAll”. To make sure it never happens again, they added a special check such that when recursively deleting data, kubelet should not cross a File System boundary so that it doesn't accidentally enter a mounted volume and purge all that data.
This was just a safety check. The real fix was to introduce reconstruction. You might have noticed that in the kubelet there is a directory that keeps the state of which volume was mounted, where it is mounted, and how to bring the volume up. With reconstruction, when a new kubelet starts and the pod is not in the API server, kubelet can restore the caches. Using the caches, kubelet knows how to match up volumes and Kubernetes pods.
The bottom line is, you can lose all your data in a matter of seconds. Storage is critical for any organization and you cannot afford to go wrong with this. To avoid this data loss event, you can run some disruptive tests using Kalc. We have a wide range of metrics for testing different failure scenarios including kubelet restarts. By running this tool in your cluster, you can get a clear idea of all possible failure events with minimal operator overhead.