In the early days, there was no virtualization. Everyone was on bare metal still using physical machines. One of the reasons that Tencent wanted virtualization was because of resource utilization. They had an average of 10% CPU utilization in their Data Centers. They wanted a system that could offer elastic resource utilization. Which means elastic CPU / I/O / Disk / Memory / Network.
That sounds a lot like Kubernetes, a software-defined container management service with high performance and scalability that enables the operation of an elastic web server framework. From 2014, when Kubernetes v0.1 was released, Kubernetes has quickly risen to become the de-facto standard for running containerized workloads. Leveraging the Docker technology, Kubernetes makes it much easier to manage large-scale Kubernetes container clusters due to features such as deployment, resource scheduling, dynamic scaling, and service discovery.
Be that as it may, one lingering inhibitor to faster adoption of Kubernetes is that it is fairly complex to learn and manage at scale. In this post, we will look at how Tencent has been able to overcome some of these challenges.
Tencent Kubernetes Engine (TKE): A unified platform for public cloud and internal use. Tencent Kubernetes Engine is 100% compatible with the native Kubernetes APIs. TKE solves operating environment issues during development, testing, and OPS and helps reduce costs and improve efficiency.
Benefits of TKE:
- Elastic resource management
- CEPH based distributed FS
- A high-performance network fabric
- HA Docker registry
- 100s of additional system/cluster/node metrics
- No single point of failure
- Advanced Features
- Native CI/CD with Jenkins
- GPU virtualization
- Advanced canary deployment
Gaming is huge at Tencent. They are the world's largest games publisher and therefore games contribute a major bulk of Tencent’s earnings. A huge chunk of the most popular games, such as Honor of Kings and League of Legends, are either built or made on-license by Tencent. In Kubernetes, you will find that there are some interesting Architectural challenges from a gaming perspective. Gaming is different, it’s a much different workload that doesn’t fit into the mold of the current CRDs that exist today.
Over the years, Tencent has been steadily acquiring a lot of gaming companies. Because of this, they’ve got hundreds of thousands of games and a lot of those services tend to be large monoliths that require a lot of resources to deploy independently.
Something that is really important in Gaming is to have static IPs for a lot of these services. If a Kubernetes Pod crashes, when it comes back up, Kubernetes by default will assign it another IP (Dynamic IPs), which is not the desired outcome. The public IP address assigned to the Pod resource is only valid for the lifespan of that resource.
The Root Cause
A lot of these Games were acquired through Mergers and Acquisitions and so it's not easy to change them. They can’t change the configs, most of them are StateFul, and they require static IPs ( static IPs cannot be assigned to Pods due to the dynamic nature of Kubernetes' IP layer ). All of these things don’t lend themselves well to working with Kubernetes.
Tencent decided to extend Kubernetes with some custom plugins using custom resource controllers and deployments. Tencent built something called Tapp which helps them manage things like being able to batch deploy outside of the StateFulSet CRD that is offered so that they can deploy much more quickly.
Also for when a service dies, they’ve set it up to not have an automatic cleanup because they are StateFul. Instead, the service waits for an operator to restart it, to ascertain that the service won’t lose state.
All these changes that they made to the Kubernetes Engine are what allowed them to reduce costs and improve efficiency when running their gaming workloads which are critical to their revenue-driving use cases.
We’ve seen that out of the box, Kubernetes is not a silver bullet for all your problems. Luckily, Kubernetes is highly configurable and extensible, which means that you can adapt it to support new kinds of software that have unique requirements by writing your own custom resource extensions to your Kubernetes cluster.
One of the main challenges of working with Kubernetes is the validation of your deployment YAMLs. Kalc.io can validate both the cluster schema and state using the OpenAPI spec. Kalc has the ability to optimize your cluster in the background, gradually increasing reliability by rebalancing and reducing cost by freeing Nodes with low utilization. For inquiries, write us an email at email@example.com