Today's distributed systems need to be resilient. Resilient, in short, is a way that ideally a user does not notice at all if a random failure takes place or that the user at least can continue to use the degraded application. On Monday 9 July 2018, a RavenDB Kubernetes cluster had a total meltdown and the database stopped responding to requests. Restarting the node would result in everything working for about 10 minutes and then the same issue would recur.
This RavenDB Cluster had two RavenDB server instances (Cluster Nodes). In RavenDB, the database is replicated to multiple nodes and this clustering provides redundancy and increased availability of data that is consistent across a fault-tolerant, HA cluster. Data is synchronized across the nodes with a master to master replication. Even if the majority of the cluster is down, data consistency is guaranteed. As long as a single node is available, the RavenDB cluster can still process Reads and Writes.
This issue would generally impact a single node, eventually causing it to stop processing requests. After the first node failure, the second node would take over the workload for a few more minutes, then terminate in the same manner.
It was evident that the incident was directly related to the load on the master nodes. Why is that?
- The system had been running for about a month on the same configuration, with 10% - 30% RAM consumption and low CPU usage within that period.
- The last upgrade to the RavenDB servers had occurred a week prior.
- There were no new traffic patterns that could have caused that load.
- The operations team swore that there had been no recent change made to either server.
This incident was even more puzzling because the nodes didn’t have high CPU, high memory or paging which are telltale signs that the system is under too much load. Furthermore, I/O rates were quite good. Testing the disk speed and throughput showed that everything was all right.
The engineers tried to revert all the recent changes, to both clients and servers. They monitored the servers to see what happens when the problem recurred and eventually they found the smoking gun. The affected node would gradually increase the number of threads being used. Slowly racking more than 700 threads. The non-impacted node would stay put at about 100, until the first node would fail, in which case all load would hit the second node, and it was able to handle the extra load, for a while, then the thread count would start accumulating, the request times would increase, and eventually things would come to a stop.
The Root Cause
An interesting pattern about the problem was that RavenDB would still try to respond to requests but then fail with this error:
The server is too busy, could not acquire transactional access
This error was an indicator that they were hitting an internal limit in the database. By default, RavenDB has a limit of 512 concurrent requests inside the node. This has been provisioned to mitigate against swamping a node with too many requests that could lead to service disruption. Increasing this to a higher value did stabilize things but still, requests remained slower than normal.
Since they couldn't handle requests as fast as they were coming in, the thread pool which dispatches requests started to increase the number of threads. However, the thread pool was unable to catch up with all the requests coming in and at some point, they hit the concurrent requests limit and the system began rejecting requests.
They moved to investigate the I/O processes where they found that 95% of the threads were stuck in different I/O related methods e.g. NtCreateFile, NtEditFile, NtDeleteFile, and NtCloseFile. The problem was no doubt I/O related although according to the performance monitor, I/O rates were excellent.
At that stage, the issue was in the system since Memory and CPU utilization was moderate. More often than not, this type of issue is caused by any tool that hooks into system calls. Anti-virus software can cause this behavior. As it happened, anti-virus software was installed on both VMs. Anti-virus software tends to look into all types of system calls and is often a culprit in such incidents.
The operations team was instructed to disable the anti-virus and reboot the RavenDB VMs. After one hour the system was holding steady with no increase in the number of threads. In fact, there was a huge drop in CPU usage from 30% - 50% to single digits 5% - 9%.
The anti-virus was looking into the I/O RavenDB was generating. As it did so, it added overhead for each I/O call made by the database. For a database, additional overhead on I/O will slow down reads and writes causing requests to queue up until it hits a limit where it starts rejecting the requests.
Kalc offers a faster, less disruptive alternative compared to this solution. Kalc replicates the current cluster environment in AI and provides autonomous checks and config validations without chaos in your test or in the production environment.
Kalc’s AI-first solution will help protect your cluster from hitting resource limits. It will also protect against multithreading on nodes with a history of consuming lots of RAM. We have a full model of RAM/CPU resource consumption behaviors and it covers all possible variations.
In summary, be very aware of what is running in your system. This outage could have been avoided by running the kubectl-val cluster config validator that can scan the file system filters installed in the system. kubectl-val will not only prevent a recurrence for such an issue but it will also improve the time to resolution. kubectl-val could have alerted about the anti-virus issue much faster thereby preventing service disruption.