How did we build a k8s operator that allows 100% up time for a high availability high workload database?
Operating a distributed high load, high throughput database in the cloud comes with several interesting challenges. In order to manage real-time serving of mission critical workloads at 100% availability we developed a Kubernetes operator that handles the operational complexities.
We needed to handle the following requirements:
- Apply live patches
- Replace live cluster with tens of nodes
- Handle degraded/crashed nodes
Under these conditions:
- High Availability - remain 100% online with no down time
- Operate under very high workloads and traffic
- Manage replicated records across different hardware failure groups (rack awareness)
Due to its stateful nature and the type of workloads that are usually handled, cluster management and recovery are non-trivial. We are using the Operators API to handle that complexity and control the clusters from within Kubernetes.
In this talk we’ll cover the steps we took to plan and execute and the challenges we faced and share the best practices.
Priority access to all content
Exclusive promotions and giveaways