Conf42 Cloud Native 2021 - Online

Building a K8s Operator for a Distributed Database

Video size:


How did we build a k8s operator that allows 100% up time for a high availability high workload database?

Operating a distributed high load, high throughput database in the cloud comes with several interesting challenges. In order to manage real-time serving of mission critical workloads at 100% availability we developed a Kubernetes operator that handles the operational complexities.

We needed to handle the following requirements: - Apply live patches - Replace live cluster with tens of nodes - Handle degraded/crashed nodes

Under these conditions: - High Availability - remain 100% online with no down time - Operate under very high workloads and traffic - Manage replicated records across different hardware failure groups (rack awareness)

Due to its stateful nature and the type of workloads that are usually handled, cluster management and recovery are non-trivial. We are using the Operators API to handle that complexity and control the clusters from within Kubernetes.

In this talk we’ll cover the steps we took to plan and execute and the challenges we faced and share the best practices.


  • Natalie Pistunovich will talk about the Kubernetes operator that we built at Aerospike for a distributed database. It's the first time that I'm speaking in a conference in such a format of a YouTube stream. If you learned anything interesting, please tag me.
  • Kubernetes operators are clients of the Kubernetes API. They act as controllers for a custom resource. The operator pattern is used for automating repeatable tasks. It is used to take the human out of the equation, because doing such things is boring.
  • The Kubernetes operator is driven by a single custom resource CR. Manages all the things lifecycle management, database cluster scale up and down, server version upgrade and downgrade. Next challenge is changes during a rolling update. How not to wipe data twice?


This transcript was autogenerated. To make changes, submit a PR.
Hi everyone, we'd like to invite you to the talk about the Kubernetes operator that we built at Aerospike for a distributed database and join the talk you it's the first time that I'm speaking in a conference in such a format of a YouTube stream and I'm excited. If you learned anything interesting, please tag me. Please tag the conference and please use the conference hashtags conf 42 and cloud native. My name is Natalie Pistunovich. I'm a developer advocate lead at Aerospike, a Google developer expert for Go, an OpenAI developer ambassador and I organize the Berlin user groups for the go community and women tech makers. I'm also organizing the conferences go for fun Europe cloud nine ha. And besides Berlin. And you're welcome to follow me on Twitter. My handle is Natalie piss. So what's in our agenda today? We're going to talk about what are Kubernetes operators, what is Aerospike? Then we're going to see high level design of these aerospike Kubernetes operator and I will tell you some of the engineering challenges that we faced developing it. So let's start with saying what is a Kubernetes operator? It's designed for automation. Kubernetes offers out of the box automation and with it you can automate and deploy and running workloads. You can also automate how Kubernetes does that. The core of Kubernetes controlled plane is the API server. It exposes an HTTP API that lets end clusters, different parts of the cluster and external components communicate with each other. We're here today to talk about the operators. So what is the definition of an operators? Operators are clients of the Kubernetes API. They act as controllers for a custom resource. In Kubernetes, a controller is a control loop that watches the state of the cluster and it makes or request changes where needed. Each controller tries to move the current cluster state closer to the desired state. The controller tracks at least one Kubernetes resource type and as we said, operator tracks the state of a custom resource. A custom resource is an object that extends the Kubernetes API or allows you to introduce your own API into a project or a cluster. A custom resource definition CRD is a file that defines your own object kinds and it lets the API server handle the entire lifecycle. For example, aerospike is a database and it's a custom resource. Our engineers built an operator for it because it's not part of the Kubernetes ecosystem. The operator pattern is used for automating repeatable tasks and it combines custom resources and custom controllers. It is used to take the human out of the equation, because doing such things is boring. And when you do boring things, you make mistakes. One more time for the people in the back, in case you did not write your own operator or did not dive into the ones that we're using, maybe this was already a little confusing. So let's say that we start with a user that tells that it wants to do things. So it sends a command to the Kubernetes cluster. The API server exposes an HTTP API that, as we said, lets end users different parts of the cluster and the external components communicate with one another. Then Kubernetes creates pods to host the application instance. Each pod is tied to a node. A cluster is a set of worker machines called nodes that run containerized apps. Every cluster has at least one worker node. And let's say in our example we have n pods that are tied to n nodes. They all are created by a deployment object. Deployments are usually used for stateless applications like web servers. Pods that are deployed by deployment are identical and interchangeable, and they're created in a random order with random hashes in their pod names. On the other hand, stateful sets are used for stateful applications when dealing with databases. You definitely want to have a stateful set because we do want to store the data. Pods that are deployed by a stateful set component are not identical. They each have their own identity, which they keep between restarts, and each can be addressed individually. Service is an object then an abstract way to expose an application running on a set of nodes as a network service. A config map is an API object used to store non confidential data in key value pairs. Pods can consume config maps as environment variables, as command line arguments, or as configuration files in a volume. A config map allows you to decouple environment specific configurations from your container images so that your applications are easily portable. In kubernetes, controllers are control loops that watch the state of your cluster and then, as we said, make a request or just change the situation in a way that's needed in order to bring the cluster state closer to the desired stateful. And it tracks at least one resource type. And these there is custom resource. For example, our database and this one is handled by these operator. And both concepts, controller and operator, they represent patterns. They don't involve language specific implementation framework, which means that in order to write a control or an operator, you'll need to follow the convention, but you don't need to use any specific language. So to put this on writing, kubernetes operators would do the following things. Probably it minimizes the manual deploying and lifecycle management, so handles things like resource management or complex resource management, scale up or down the size of the cluster, and upgrade or downgrade the version. It manages your configuration. And of course it does. Monitoring basically take the human out of the boring part of the work to make sure that no mistakes are made. So in our example, what does the operators manage or a little bit about aerospike aerospike is a NoSQL database and it implements a hybrid memory architecture where the index is purely in memory, so not persisted, and the data is stored only on a persistent storage and reads directly from the disk. The disk I O is not required to access the index, which enables predictable performance. There are however strict SLAS petabytes of data handled in sub milliseconds and there is transactional guarantees. Basically means that the database transactions provide asset guarantees if needed. You can also have strong consistency, which is a term you probably heard from the cap theorem for distributed databases. All those reasons are why clients that are big financial institutes, for example banks and other clients from other industries, are using aerospike. Some other features that are a little more technical but relevant for the rest of this presentation would be rack awareness, which is a feature that allows you to store different replicas of records on different hardware failure groups. For these resilience and a multicluster XDR or cross data center replication setup. Multisite is when the nodes that are comprising a single cluster are distributed across different steps, a physical rack in a data center, an entire data center, or an availability zone in a cloud region. Basically, these cluster is stretched across regions and cloud providers, and it expands horizontally. This uses synchronous replication to deliver a global distributed transaction capability. The update speed is only limited by things like the speed of light. You can also go asynchronous for that. You'll use the cross data center replication setup, which uses asynchronous replication to connect to clusters that are located at different geographically distributed sites. It can extend the data infrastructure to any number of clusters easily. So we talked a little bit about the database. Let's talk about the operator. The Kubernetes operator is driven by a single custom resource CR, and it conforms with operator custom resource definition CRD. The cluster specs include things like the size, so the number of nodes per cluster and the resource allocation request. For example, the cpu per node it has the complete aerospike configurations and for example the YAML version of the Aerospike server, converting YAML based configuration to the aerospike version of them. And it handles the security configuration, TLS user management and so on. Because of some of the special features that we covered, it has some special considerations in what it does and how. So the deploying of the database clusters is a pretty obvious feature. Manages all the things lifecycle management, which means database cluster scale up and down, server version upgrade and downgrade, aerospike configuration management, rack awareness management, and cluster access control management. Also it handles all the fine details of the multi cluster cross data center replication setup. So remember how we said it spreads across the different availability zones, different cloud providers, and even combination of cloud and bare metal? That's a lot of configuration to figure out and keep up. And it monitors everything. So here are some of the engineering challenges we faced when we were developing it, with the first one being the persistent data. Each pod has a dedicated storage, and as we said, it must be persistent. We are using a database here. The logic is if it's new storage, it means that it probably has old data, because just like with computer memory, you cannot assume whether the storage that you were allocated with is empty or just filled with crash. But if you're restarting a pod, it means you probably have their relevant data. So you definitely don't want to touch that when you change the configuration. This is when the pod restarts, or when maybe something went wrong. You do want to save the storage, you do want to reuse that data. So a restarted pod has no metrics and there's kind of no kubernetes way of using that and telling whether this is a restart or a new pod. It also does not help that you probably have a new image because you did something like a version update. So how do you do this? These answer is flags. Add a flag using the init containers. This is what you run before your containers run, and that's how you init these devices. This is where you do the wiping of the data. How not to wipe data twice? The operator makes a cr for each resource in which we create a single tone instance upon initialization. So basically when the wiping is happening, we're adding a flag in the file. And then next time a pod restarts, it checks the config file. It steps that these flags exist so it knows not to wipe the data. Smart. Next challenge is changes during a rolling update. Say something happened, and the solution is to update the server version on all the ten nodes update on, node one complete, node two complete. Node three complete, abort. Suddenly you realize this is not the right thing for you to do, and you want to abort the server version update. But if the command that you issued is update on all ten nodes, how are you going to stop that? The way that we implemented this is that after every operation it recues the reconciliation request. Basically the operator is asking the API after every node, now what? This way, after it completed updating node three to the new version, the next step that it will receive as a command in the response to the question now what would be rollback or update to the old version? Node one, these, node two these, nodes three. And to make things even more efficient, the operator requests a delay in the response. Let's say that it knows that the migration of this specific node, which it cannot abort, will take this amount of time. A few records, it tens to the API, please respond to me, but not right away, but in a few seconds, because it can be that in those few seconds until the migration will be over, you will receive yet another change. So just make sure that you save resources and tell me what is the most up to date thing that I should do. And while it sounds very trivial to you now go check out different operators. I think these is a cool idea. The third challenge is what happens when you reach a really large scale. And well, we know that cloud is great for prototyping, but it can get pricey at a very large scale. And we have customers with half a trillion objects to give you a scale. Imagine that $1 buys you three and a half objects. So this is a little bit of a funny thing to say, because an object is kind of a row in the database, and, well, $1 does not buy you three and a half rows. But let's just say in this case half a trillion is how much Jeff Bezos can afford. And remember, we have these slas for clients that require petabytes of traffic in sub milliseconds. Next, there is the issue that cloud hardware is not homogeneous. It gives you a promise of a minimum cpu, but it doesn't commit to that. It means that some of the machines are in the minimal setup, but others can have a higher setup. And aerospike, due to its architecture is disk heavy. Sharing the I O means you definitely get a slice, but it's hard to cap the size of it. This means that your machine might respond slowly on messages and distributed database send around a lot of messages to make sure that they're in sync, to make sure that they know where is the most up to date replica of the data is right now and also to endpoints like the heartbeat, then there is also networking. The network is not private, but you do get a slice of it. However, if you have noisy neighbors, they can drive your performance down. Also, let's not even start the conversation about the cascading effects of such any of those interruptions. It's something that it's really hard to predict and well, the spoiler is there, an operator alone will not solve these. So how do our clients solve that private cloud? When you get to a really, really large scale, get a whole host and split the resources internally and do budget to have comes overcapacity. When your client is for example, snap in their scale, you do want to read from the client, not from the master. Be aware and max your communication to a local one because latency matters a lot at the scale. Of course Kubernetes will work great in such a setup. Think about it, it started inside Google in their private cloud. And yes, of course the operator will work great in these setup as well. If you want to read the operator source code, of course it's open source, available in GitHub. And here's a recap of what we saw today. We talked about the Kubernetes operator, how it controls the custom resource, what does it mean and what is a custom resource? Then we discussed a little bit some of the challenges that we faced when we built our distributed database operator at Aerospike. And the recommendations that you should be taking home are keep the data upon pod restart because that is a database. Be able to revert a rolling update immediately because that's definitely important. Doesn't happen often, but in the one time you do want this to happen and at a very large scale you can have all sorts of new problems and you need more than one solution. And automate, automate automate thank you very much for attending the talk. Please tweet, please share and I am looking forward to all your feedback. Thank you.

Natalie Pistunovich

Lead Developer Advocate @ Aerospike

Natalie Pistunovich's LinkedIn account Natalie Pistunovich's twitter account

Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways