Deep Dive into Kubernetes Operators: Learning the "Why", "How" & "What" of DevOps's Best Friend

Video size:

Abstract

Kubernetes is hard, complex, and stateful apps make it even harder. Operators are here to save the day for DevOps. They package and deploy the app-domain knowledge into your cluster and let it do its magic to manage the life cycle of your complex apps. But, what are these operators? Why do we need them? Why now?

Summary

Kubernetes can be seen as a composition of two different categories of workloads. There is the control plane which you can relate to like a brain behind your Kubernetes cluster. And then of course you have the data plane or these worker nodes which is where your workloads get scheduled. This is the foundation for building operators.
Control loops are a continuous loop which is noticing these state changes as they are coming in and taking a certain action. The purpose is that as and when the stateful change happens, a control loop would try to reconcile it with the state that is actually being observed in real time.
Control loops offer a higher level abstraction for Kubernetes users. You as the cluster administrator or cluster user apply the desired state. The control loop is executing continuously, watching for those resources as are they're being created, updated, deleted, modified.
Kubernetes by design is distributed. Three fundamental components of any controller are the informer, the work queue and the events. This shows the extent of coordination that happens within a Kubernetes cluster when a state change is detected.
Most of CNCF hosted projects are actually written in go. Here is a snippet of go code for a replica set controller. The idea is not to go deep into what this code is trying to do. But I would still encourage all of you to become a contributor to one of these projects.
When it comes to resource state reconciliation, there is another very important concept in Kubernetes to understand. The concept here is the optimistic concurrency. It is a way of dealing with concurrent rights that might be affecting the same resource. But maintaining the integrity of the state is also equally important.
Operators are basically control loops, or you can think of them as custom control loops and more. More is the operational intelligence that an operator has about your workloads. Operators were first introduced in 2016. Today in cloud native ecosystem, you want to automate as much as you can.
Custom resources offer you a mechanism to extend the Kubernetes API. They allow you to define your own custom kinds of resources. The idea is to help you define abstractions that end of the day you want to offer to your own end users.
The problem is that there's nothing in that cluster which is aware of any action that can be taken when this type of custom resource is created. That's where we write custom controllers. And when we glue them together, these is exactly what we get through operators.
There is Kube builder framework. We briefly talked about the operator SDK, which is part of the operator framework. Cube builder awareness is important. There's an excellent online book about Kubebuilder, which has a lot of examples and tutorials and DevOps.
operator SDK is part of the operator framework. It is based on, you know, developed by Coreos and Red Hat together. Uses a bunch of libraries like controller runtime, API machinery and many others to make operator development easier. You don't have to bog yourself down if you do not know Go.
This talk is revolving around Kube builder and operator SDK or the operator framework. There's a lot of scaffolding that happens behind the scene. Each command has its own significance. However, this is only for local testing and development. Your operator would be deployed to your cluster as a Kubernetes deployment eventually.
For the Kubernetes cluster I'm using local kind cluster, though it's not a hard requirement. You are free to experiment with any other distribution of kubernetes. The idea is to provide you a mechanism to apply the knowledge that you have in a native way.

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hello everyone and welcome to this deep dive. Talk about the Kubernetes operators. Well, let's begin. Well before we go forward, I think it's always a good idea to give ourselves a quick primer about Kubernetes architecture, right? How kubernetes is constructed, how it is being laid out. So we all know that kubernetes can be seen as a composition of two different categories of workloads, right? These is the control plane which you can relate to like a brain behind your Kubernetes cluster. That's where all the intelligence is built. And then you have the worker nodes, also called as data plane. This is where your custom applications, custom workloads are designated to run. Now control plane is actually an overarching term for a collection of different services which are native to kubernetes, including your API server. That is kind of the gateway through which you interact with a cluster. Then there is controller manager. This specific component will be our area of focus as we move forward because this is exactly where the control loops reside and control loops are indeed the foundation for building operators. Then there is the Kubernetes scheduler which is responsible for scheduling your pods, your workloads, finding the appropriate node or nodes for your workloads to execute. Then the persistent store or the key value store which is used by Kubernetes where the state of your resources is actually saved, called as EtCD or etzed depending upon how you pronounce it. Then there is a cloud controller manager which is sort of an entry controller manager consisting of control loops, but it is dealing with all things external to the cluster and can be very specific to certain cloud providers. So we are not going to go too deep into what a cloud controller manager is and what its roles and responsibilities are. However, just for awareness so that you know that there are in fact two controller managers in the control plane, the controller manager responsible for managing and operating upon the native Kubernetes resources within the cluster. And then there is the CCM or cloud control manager. And then of course you have the data plane or these worker nodes which is where your workloads get scheduled and execute data plane in Kubernetes. Every node especially will also be running a couple of pretty specialized workloads. One is your kubelet which is sort of a node agent in Kubernetes, and then the kuberoxy which basically uses IP tables for various different purposes like packet filtering, source netting, dnating and to an extent load balancing the traffic. When you create your services with a set of pods as these back end. So it serves a bunch of those different purposes, networking related mostly. As we move on, let's try to build some foundation first. Right? So we touched upon the controller or the controller manager, right? Controller manager is responsible for executing a bunch of control loops, but what these control loops are, what specific action are they going to perform or are performing and on what. So control loops are implemented by? Well, pretty obvious controller and their responsibility is really to watch over the state of the specific Kubernetes API object that they are assigned to. And then based upon the state changes that they are watching, perform a specific action, either make a change or request a change and delegate to someone else, most likely another controller to act upon it. But the idea is very simple, right? It's a continuous loop which is noticing these state changes as they are coming in and taking a certain action. But what's these purpose of it? Right, the purpose is that as and when the stateful change happens, a control loop would actually try to reconcile it with the state that is actually being observed in real time in that very cluster. So that is the pretty foundational or fundamental aspect of the mechanics on which a control loop operates. That is, to watch for the state changes, compare it with the actual state, and then take the appropriate action to reconcile or to match the desired state. That is something that you apply as a cluster administrator or as a developer deploying your workloads on that very cluster with the state that these Kubernetes control loop happens to observe in the cluster. It could be something as simple as scaling out the number of replicas for your web server, or it could be something pretty complicated like provisioning persistent storage for your database workloads and setting up replication backups, snapshots, point in time recovery, et cetera, et cetera. All right, moving on. Now, this is actually a pretty oversimplified view of what a control loop might actually look like in Kubernetes, right? But I wanted to keep things simple and not to kind of overwhelm everyone with all the complexities. Otherwise this diagram could get really messy, trust me. But yeah, the idea is like, hey, we are in the very initial phase of understanding what controllers and control loops are. So let's start with something which is clean and simple. So as we can see here, right, a pretty simple flow. You as the cluster administrator or cluster user apply the desired state. The control loop is executing continuously, watching for those resources as are they're being created, updated, deleted, modified, and take specific set of actions. So let's say you deployed a web server with three replicas control loop will say, hey, I see a new deployment coming in. Let me go and start spinning up three identical pods which basically meet the criteria that you just defined through your manifestation or through your yaml. And that is the whole value proposition of control loop if you ask me. Because look, as an end user, you are not really dealing with pretty low level or the intricacies of kubernetes. End of the day, when you supply a deployment or when you supply a stateful set, everything is going to translate into a set of pods and pods, into containers and whatnot. But this is the beauty of control loops, this is the beauty of controllers, that these are letting you work with kind of more higher level abstractions, what is also called as controller objects. So if you look at deployment, stateful, set job, cron jobs, all these are basically controller objects which offer you kind of a higher level abstraction so that you don't have to deal with low level mechanics of kubernetes. Like what are the resources that should be created when you create a deployment, right? And the additional benefit is that not only you are being abstracted away and being offered something more simple in order to provision the workloads, but there is this continuous reconciliation which is happening behind the scenes to make sure that the state of your application remains as you defined it initially, so that there are no surprises. Like you go to bed in the night and one of the pod or one of the replicas of your deployment crashes out. Control loop is observing that, hey, I see there's a delta. Because the user wanted three replicas of this web server. I see only two. So let me go and request another replica. Right? That's the whole value add that control loops actually bring in. I have linked one excellent resource here. There's a book programming kubernetes from O'Reilly. I think it's an excellent book. If you have some go programming background for you, go check it out. There's a lot of good information about the client APIs and the internal mechanics of Kubernetes, how control loops really operate. It's a very good resource if you are into Kubernetes native development. All right, so we test upon the control loop. But what are these building blocks of the control loop? Right? I mean, if you were to design, if you were to create a controller for your workloads, how would you basically go about creating that? Right? So there are several different frameworks, utilities, sdks which are available today in order for you to start writing your own operators or custom controllers. But fundamentally, regardless of what SDK or what framework you choose, there are going to be three fundamental building blocks, three fundamental components of any controller, which are the informer, the work queue and the events. Okay? And they all have a pretty specific assigned set of responsibility within your controller. So informer, as an example, watches these state changes, right as it propagates in the cluster. It implements resync and reconciliation mechanism that we just touched upon. These, there are work queues which are basically responsible for queuing up these changes. Implement retries if something fails upon the first attempt to reconcile, and then there are events which are basically the native representation for these state changes. Think about the crud operations that you perform on your resources, creating a new stateful set, creating a new deployment, updating a number of replicas, or a replica crashing out for whatever reasons, and your controller is now responsible for creating another one. So all of these state changes, all of these modifications which take place are represented as events to your controller. Well, I apologize for the not so clear sequence diagram because indeed there are a lot of pillars and boxes here. That's why I have linked the resource where I kind of drew a lot of inspiration from. There's an excellent blog from Andrew Chen and Dominic Torno called the mechanics of kubernetes, which is very thorough, very detailed, and it was written four or five years ago, so it's been some time since it's out there. Please go and check it out and how they have kind of laid out how Kubernetes control loops really work behind the scene. I think it's a fantastic resource if you want to dive deep into controllers and control loops, but on a high level. What is represented here through an example is basically the extent of coordination that happens within a Kubernetes cluster when a state change is detected. So one thing to note is that kubernetes by design is distributed. I mean, right now we talk about monolithics and microservices, which is better versus which is worse. So we talk about all this. Kubernetes was fundamentally designed and architected, keeping a distributed system in mind. So we saw the architecture, right? And how the control plane is comprised of different components. Why all the business logic is not built into a single binary for the sake of scale. Of course, distributed architectures are capable of scaling better. There could be challenges, especially if these components require a lot of coordination between the different components. It can get complicated, it can get complex. But Kubernetes is built with scalability, reliability and efficiency in mind, right? And with microservices distributed architectures in general, you get that, right? So again, going back to this little diagram here on the left, it is just a representation of these different control loops getting involved when you actually request a deployment to be created, right? So let's say I want to deploy Nginx as a deployment in a Kubernetes cluster and I want three replicas of it, right? As a user I can happily apply my manifestation and well go and fetch some coffee. But let's look at the work that Kubernetes is performing through its own control loops. When you perform an action such as creating a deployment in the cluster. So of course API server will intake your request, right? It creates the deployment resource persisted. What happens afterwards is really interesting because after the API server you see a bunch of different controllers, right? So there's a deployment controller which is watching over the Kubernetes API resources of type deployment. Deployment is a kind in Kubernetes. Now it sees that hey, a new deployment has just been created. What am I going to do now? I'm going to request the API server to create a replica set. So it sends out a request to the API server say hey, corresponding to this deployment, can you create a replica set? API server says yes, why not? It creates another resource. You can think of that as a child resource of the deployment. But the resource created now is a replica set. Then there's a replica set controller which is watching over the creation of new replica sets. It observes that these, there's a new replica set that has just been created and it says it needs three replicas, three identical replicas of a certain type of container. The container here being your NgInx web server. But how many do I have right now? None. So let me ask the Kubernetes API server to go ahead and create three pods which will represent three replicas of this deployment or this replica set. Again, Kubernetes API server creates the pods, pod is created. After that what happens is the scheduler which is another component of the control plane that we saw in the architecture overview looks at these pods and figures out that hey, these are new pods and they need to find a node in order to execute. So let me go and find the appropriate nodes for these pods. It happens to find it and your pods get scheduled to run on a certain node. Then these kubelet, which is also a kind of control loop is looking for the pods which are designated to run on the very node where this kubernetes is. You can have like hundreds and thousands of nodes in a Kubernetes cluster, right? And each node is going to run Kubelet. So think of these individual kubernetes on these nodes as a control loop of their own, which are waking up periodically to check if there are any pods which are scheduled to run on the node on which they are executing. And if they find, then their next job is to go and well launch the containers. And that's how your workload comes to life. So as you can see, right, there's a lot of coordination happening between different components. But the idea here is that how beautifully through the control loops Kubernetes is basically providing you this automation and a pretty good degree of abstraction that you as an end user is only requesting one single API resource, which is a deployment. And behind the scenes all the complexities that comes in, whether it is creating replica sets, then the pods, and then finding out the right node to run these pods, because it depends upon the resources that are available on that node versus what you are requesting in terms of cpu and memory. Kubernetes does not want you to worry about all of this. And it basically provides that automation, that intelligence to deal with the lower level mechanics, which is a huge plus. And that's one of the biggest value proposition of Kubernetes, right. Why it is so popular? Because yes, workload orchestration, workload management, all that is good, right? But look at the degree of automation it provides. Load balancing, scaling out, auto healing, auto repairing, restarting the crash containers. This is all fantastic right? Now if you move on here now again, it's a pretty difficult to see type of a screenshot here, but the idea is not to kind of go deep into what this code is trying to do and in which language or runtime it is being written. So it's go code, it's a snippet of go code. Kubernetes is written in go. And if you do not know go or have not programmed enough in go, that's perfectly fine. But I would still encourage all of you to at least if you really want to take your Kubernetes expertise to the next level and maybe in future become a contributor to one of these projects hosted under CNCF. Majority of CNCF hosted projects are actually written in go. So as Kubernetes and here what this code is trying to do. By the way, this is the source code for these replica set controller which we just discussed on the previous slide, right? This is the real code hosted on GitHub. You can go and check it out if you want to. After these talk, this is the real code. And what this code is trying to do is just do something pretty intuitive, pretty simple, right? It's just looking at the desired number of replicas. And if the replicas requested are less than the replicas which were observed in these cluster, it's creating new one. And if the replicas requested are less than what is observed in the cluster, it is going ahead and terminating some pods, right? But end of the day, the idea is very simple. Whatever is the persisted state in the persistent store of your kubernetes cluster, which is etcd or et cetera, that is being taken as a source of truth because that is what the declared state is. And all it is trying to do is to basically match that declared state, which is a desired state, to what is being observed real time in the cluster, right? So that's pretty much how all the control loops are going to work depending upon what type of resources they are dealing with. That's why I thought it would be useful just to kind of put this little code snippet without worrying about whether, you know, go or not. Or your preferred programming language of choice is not go. That's perfectly fine, because the idea is, again, not to overwhelm anyone with the mechanics of go and how go language work, but just to kind of show you that how a controller is going to execute its assigned set of responsibilities. When it comes to resource state reconciliation, there is another very important concept in Kubernetes to understand. So we discussed about Kubernetes being distributed. It's a collection of different components which you can broadly think as different microservices kind of work in a cohesive fashion together. The concept here is the optimistic concurrency. Now, why optimistic concurrency is important? I mean, it's very obvious that when you work with distributed systems, right, and then you have state and the state is shared, integrity of the state becomes very important. Now, Kubernetes is meant for running workloads at scale. That is why it consciously does not employ anything like resource locking or synchronization, because that could be a hindrance to performance and scalability both, and might actually result in higher resource usage than it should otherwise. But at the same time, maintaining the integrity of the state is also equally important. That is why Kubernetes employs something called as optimistic concurrency. Optimistic concurrency, to put it simply, is Kubernetes way of dealing with concurrent rights that might be affecting the same resource. So what kubernetes does, basically behind the scenes, it maintains a resource version with every resource, right? And as and when this resource undergoes changes, the resource version keeps changing. So as a client, when you happen to fetch the resource, make an update and go for persisting that resource through the API server. The Kubernetes API server is going to check if the resource version has changed since then, right? And if it has, then that request is going to be rejected because you might be operating upon stale data, right? And at this point in time, what Kubernetes, typically this is what even the control loops will end up doing, and we will see that. But in this case, your client, let's say it's a control loop in this case, becomes responsible for handling these errors related to concurrent writes and simply re queue these resource to be retried later so that you can be sure that when you retry you fetch it again. And hopefully this time you get the latest and there are no concurrent rights going around. And then you make your mods and these you apply it. So that is what the principle of optimistic concurrency is, that don't do any explicit locking, don't do any synchronization, rather rely on resource version. And if a specific client happened to run into a problem where its request to write or change the state of the resource was rejected, re queue, refresh, update, and then try. All right, so let's move on a little bit. Let's talk about operators now. So we discussed about control loops, we discussed about reconciliation. Operators are basically control loops, or you can think of them as custom control loops and more. But what is that more, that more is the operational intelligence that an operator has about your workloads. Operators were first introduced in 2016. So they are not a very novel, very new concept. They have been in use. And one of the founding principles for creating operators, or the whole framework to help you create your own operators was that, hey, can we codify, can we translate all the operational knowledge that the support engineers DevOps, engineers, cytoliaty engineers have developed over time by operating a specific type of workload in production. So think about a database, right? When you create a database, it's not just about start consuming them as and when you create it. There's a lot of operational exercises that you would have to do to manage a production scale, a production level database. Think about backups, think about snapshots, think about point in time recovery, think. But the transaction logs, how you archive them, how you back them up, think about high availability, think about replicas, right? So there is a lot of operators complexity. Now, back in the days, maybe you had individuals who did that. But today in cloud native ecosystem, running your workloads at scale, you want to automate as much as you can and reduce this toil. It was okay probably if you had like two or three, but you can't do this for thousands and thousands of databases that you are running in production, right? So how do you automate, especially if you were to run these databases in Kubernetes as stateful workloads, how do you basically bring about this operational excellence through automation, so that you just worry about creating your databases and leave the rest to Kubernetes? In this case, that is where operators add a lot of value. And this is exactly the purpose for which the operators were actually created, right? It is an end user. You work with these abstractions, maybe like a simple manifestation which defines what your database is, provides some fundamental information. And when it comes to the more like day two, day, these type of operational exercises or operational activities, you leave it to Kubernetes. There is an excellent blog, it's a pretty old blog written in 2016 by a few folks from Coreos, the company which actually created the operator framework, which is pretty widely used today to build custom operators and controllers. You can go and check out this blog. But this blog kind of lays out the ideology behind creating operators, why we need them, and what is the whole value proposition of having them. So any operator in Kubernetes has two fundamental building blocks and we will discuss both of them. However, we already touched upon what controllers are. In this case, it's just going to be custom controllers, which you will write using the available utilities, sdks and libraries. But there is one more concept here, which is the Kubernetes API extensions or the custom resources. So what are custom resources? Right, let's dive into that. Custom resources. Well, it's not a new concept. It's been out there since pretty primitive version of Kubernetes, which is one seven, right? But custom resources offer you a mechanism to extend the Kubernetes API, which is to help you define your own custom kinds, right? I mean, if you look at deployment, stateful, set, pod, these all are a kind of object, right? These all are predefined kinds in Kubernetes. Kubernetes understands these. When you deploy a deployment through a YAML file and you mention the kind is deployment and the API version is this, which is the API group, and the version of the API, Kubernetes has an understanding of it, it knows what it is going to do. Similarly, if you have very specific type of workloads, and again, I will take database as an example, right? Yes. You can deploy, let's say a postgres database in production as a stateful set and be done with it, right? But is that the only thing that your database as a system consists of? Probably not, right? Because there is of course the storage part, which is the persistent volume and persistent volume claims. Additionally, you might want to create a couple of service fronting these pods so that your client can connect to these databases. You may want to define some access patterns, some database users their roles. You might want to define some secrets and passwords to be stored either natively or outside of the cluster, depending upon your architecture. So you can see or you can probably imagine by now that hey, it's kind of getting complicated, right? Because a database in a Kubernetes cluster may not just be a single resource, it's basically a collection of different resources, right? So how do I basically make it more abstract, something more generic which represents database to the end user who's deploying it, but at the same time does not overwhelm them, right? And also let Kubernetes figure out that what are different lower level mechanics it has to apply in order to honor this specific abstraction or the resource abstraction and get it functional and running, right? That's where custom resources come in, right? And they're pretty common. I mean almost every Kubernetes cluster that you are dealing with today in production would probably have some or the other custom resource either deployed explicitly by you or by the provider that you use, whether it's AWS or Azure or GCP, because that's how these providers are deploying a lot of managed components for added value is through custom resources and operators. But if you look at all these different CNCF hosted projects, istio Flux, argo link id, they heavily use custom resources and the idea is the same. Take istio as an example. So service mesh, right? And you just want to define, hey, these are the different policies that should exist. These are the different authorization rules that should exist, right? End of the day this has to translate into lower level constructs, right? And that lower level construct could be something as simple as an iptable rule, dropping the packets when one ip attempts to talk to another ip. But as an end user you don't want to deal with iptables directly, right? And probably there is no way for you to deal with them directly either. So how do you go about configuring them? You use custom resources which will tell Kubernetes in a way that hey, this means some change in the networking stack so let me go and perform it as a user, you don't worry about it, right? So that's the idea of custom resources is to help you define those higher level abstractions that end of the day you want to offer to your own end users, but at the same time are being understood by Kubernetes, which is responsible for taking the actions at a much lower level. But how do you go about building custom resources? You just can't create custom resources out of thin air, right? Every kind, every type that you deal with in Kubernetes has a schema. It has to follow a set of rules, right? I mean, a certain property can be of type, integer versus string versus boolean versus a map or a list, a dictionary, right? So when you create custom resource in Kubernetes, you actually start with something called as custom resource definition, right? Custom resource definition is essentially what provides you a well defined schema for creating your custom resources. Sometimes we call these CRD, sometimes customer source definition. I think you will hear the word CRD crds a lot throughout your Kubernetes journey. But crds essentially are what which provide the schema definition for your custom resources. So the sequence would be that you will write a CRD first and then you will apply it to your cluster. Then you will write a custom resource based on the CRD and you will apply the resource to the cluster and when you will do so, just like how API server is capable of checking the native objects for correctness and for adherence to the schema. Similarly, your custom resource would also be checked against its definition, whether it is meeting the criteria or not. And if not, your request would be rejected and the resource will not be created. Again, there's a quick sneak peek. This is from the official documentation, Kubernetes documentation. You can go and take a look yourself, but these idea here is that you have a CRD on the left. You define your API schema and then you eventually start translating or creating resources out of this CRD. As long as they adhere and they comply with what you have defined, you should be able to create these custom resources in your cluster. All right, but here's the problem, right? In one of the earlier slides, we saw the sequence of events which happen when you create a deployment, right? I mean, a bunch of controllers getting invoked and acting upon it within their own capacity and doing something, right, because they're aware of a certain type of resource, whether it's deployment or replica set or pod, whatever, right? But what about this custom resource? Who's aware of it? Yeah, it got created, it got persisted, you can query it. But what's really happening? Technically, nothing, right? Because the control loops that Kubernetes provide, they are specific to a certain kind. So in that case, that kind was either a deployment, a replica set, or a pod. Now, I have defined my own custom type here. There's nothing in that cluster. There's absolutely nothing in that cluster which is actually aware of any action that can be taken when this type of custom resource is created. That's where we write custom controllers. That's the missing piece of puzzle. And when we glue them together, these is exactly what we get through operators, right? The custom resource that we define, we create, and the custom controllers which are basically the control loops which are now going to be aware of this custom resource and will implement the business logic, which will take a list of action and will deal with the lower level mechanics of kubernetes. Whether your custom resource demanded creation of a stateful set and persistent storage, a bunch of secrets and config maps, or it requested something else, doesn't matter. But now you have a controller who's looking for it and the very last bullet point here, right, these operator SDK. So that is a utility that we will actually use or we'll actually take a look at to build out a custom resource and a custom controller. There are a couple of other frameworks also, but we will walk through operator SDK in this talk. Of course, it's one of the more widely used and pretty easy to use framework. Yeah, we will quickly take a look at it. Like all the boilerplate stuff that operator SDK provides and how it basically simplifies the operator development in kubernetes. So there is Kube builder framework. We briefly talked about the operator SDK, which is part of the operator framework. It's not super important to know Kube builder framework in and out, but what is definitely good to know, since you will probably be dealing more with, if you happen to work with operator framework and operator SDK, you would be dealing with these specific toolkit that operator SDK provide. But Cube builder awareness is important because one thing to note here is that operator framework is actually built upon cube builder framework. So Cube builder framework existed before and then the operator framework kind of came in, made it a little bit more simpler, intuitive to use, but the fundamental building blocks, right? Things like informers, workers, clients, reconcilers, they were all defined by these cube builder framework. So just for your awareness that there is Kube builder framework, of course you can build an operator using Kube builder framework directly. Many people do that. And if you want to explore more about the Cube builder framework, there's a link here at the bottom. There's an excellent online book about Kubebuilder, which also has a lot of examples and tutorials and DevOps for you to take a look at. So please do refer to it. And then there's operator SDK, of course, right. So operator SDK is part of the operator framework. It is based on, you know, developed by Coreos and Red Hat together. And like I was mentioning the boilerplate stuff on the previous slide. So operator SDK actually uses a bunch of libraries like controller runtime, API machinery and many others actually to make the operator development easier, right? And take care of some rudimentary stuff which you otherwise as a developer would probably prefer not to do, right? So scaffolding, creating automated unit test cases, a lot of code generation, bootstrapping, a way to run your operator locally while connecting to a remote Kubernetes cluster for testing purposes. And what is more interesting about the operator framework that we saw examples in the previous slides from the replica set controller and how it was written in Go. But that's what I was mentioning, that you don't have to kind of bog yourself down if you do not know Go, because operators SDK not only supports writing operator with Go, but you could actually write operators with ansible and help. And the link is there at the bottom. You can refer to the operator SDK documentation or operator framework documentation. There are a lot of details out there, but the idea is that if you have a choice of runtime and it's not go, but rather you are comfortable writing your operator in something else like ansible or helm, you could actually do that. And by the way, if you have, let's say, python programming background, there is an operator framework which is called as Kopf or cough, which is Kubernetes operator pythonic framework. You can check that, but as well. So like I was saying, knowledge of Go is nice to have, especially if you want to kind of dive deep into some of the engineering decisions that Kubernetes team has made or will make in future as the Kubernetes platform continue to evolve. But if you are trying to extend Kubernetes, if you're trying to create your own custom resources and operators and write your own control loops and you do not know go, that's perfectly fine because there are other options out there and people are using frameworks like Kopf. This talk is revolving around Kube builder and operator SDK or the operator framework essentially. But like I said, it's not a hard limitation. Now, I have given some pretty basic commands here for you to refer. And of course, given the time foundation and the length of this talk, while I won't be able to do a live development of an operator, these are still very handy commands, and there's a lot of documentation and information about what these commands are really trying to do. If you refer to the official documentation of the operator framework. But executing these commands and what you should expect as you execute the command, that is something we will definitely go through in a moment. I will share my screen with my ide so we can actually take a look at all the scaffolding, all the boilerplate code generation and what's really happening behind the scene. But these are some of the key commands that you would actually need to know or need to be aware of, starting from the initialization where I give a appdomain option. So that domain option would basically become a part of my API group, right? If you know that API groups are basically qualified subdomain names in Kubernetes, the repo here is nothing but kind of a name for my Go module, because end of the day my operator would be packaged and served as a Go module. So this GitHub.com slash acme redis operator, no, it does not have to exist somewhere on GitHub. It is just the go module naming convention. The second command creates the APIs and the types, the Go types. So very important to understand as well, the group here is cache. So when you will actually create your manifestation, like how you see apps, one when you create a deployment or networking k IO if you create a network policy, when you will create a resource of type redis which is basically provided in the kind option. In the second command, the API group would be cache appdomain, which is Acme IO, and then the API version. So if you're comfortable, if you know how the version semantics really work within Kubernetes. So we are starting with v one, alpha one, which is of course not a production ready API. And as we mature it, we go to beta one, we go to beta, and then we go to g. Eight becomes v one, right? So that's how the Kubernetes version semantics really work. Again, not something we need to go into details. And then of course there are some make commands here. Operator SDK uses a make file with some specific target, and each command has its own significance, starting from generating the types, the kinds to generating manifestation, which involves creating crds and some bases and some samples and some cluster roles and role bindings for your operator so that it has the appropriate permissions to operate on a specific type of resource. And then the make install run command is basically a utility command which is included in the framework to help you run the operator locally. However, you can assume that this is only for local testing and development, and as you would develop an operator for your production systems, end of the day, your operator would be deployed to your cluster as a Kubernetes deployment eventually. Right? So let's do a quick walk through. So I'm just going to unshare this for a moment and share my screen. Just give me a moment here. All right, so let's see a working example of an operator. On the previous slide we saw bunch of different commands with respect to the operator SDK and what each and every command would actually do. However, the idea is kind of not to go into the details of what I am really getting by each and every individual command that I execute, because you can very well find that level of details in the official documentation of the operator framework. The idea here is to kind of really make you understand that how these operators are going to behave at runtime. Right now, before we begin, I just want to kind of make you aware of the directory structure that I'm using here. So conf 42 Redis operator is my project directory. Now here you see, these are a lot of subdirectories, a bunch of files in here. There's a lot of go code. Let me make you understand a couple of things, right, the operator framework, or even for that sake, if you're using cube builder, their job is to simplify the operator development task, and they do so by taking care of lot of boilerplate stuff that you will otherwise have to write by yourself, right? So there's a lot of scaffolding that happens behind the scene as you run those commands. That includes generation of lot of these files, go code markers, annotations. Even your custom resource definition is created based upon what values do you provide against some of those options when you do an operator in it and create APIs? There is also, like I said, a lot of go code that gets generated. For example, redistypes. Go basically defines the go structure representation for my custom kind, which in this case is Redis. Again, if you do not know go programming or have not worked exclusively in Go, that is perfectly fine. The idea here is kind of not to make you a Go expert or assume that you are a Go expert, but just to kind of show you the whole value proposition of using a framework like operator framework, right? All these go files, trust me, I didn't write anything from the scratch. A lot of skeleton code was created for me and then I happened to just kind of decorate this code as per my needs and my requirements and run some of those, make commands to create my CID, generate a skeleton code for my controller, things like that. So if you look at right here under config CID, since we just mentioned CID, look at this, I have my custom resource definition created, right? This is where I have the whole API schema for my custom resource, which is redis in this case. And what's going to look like it has taken care of specifying all different things here, right? Whether the group, the kind, the API version, whether my type is namespace is scoped or cluster is scoped, all these details are actually captured under CMD. I have my main go. So if you're familiar with, you know, package main function main is always going to be your entry point for the program to execute. And here I see that with the help of controller runtime package, which I'm importing here, I'm able to instantiate a manager. And if you recollect on the architecture slide that we reviewed like what Kubernetes architecture looks like, there was this controller manager which was responsible for running and managing a bunch of different control loops. This is exactly what it is. This is my controller manager, and this is where I'll be bootstrapping my custom control loops, right? And then if I go here, internal controller, I see the redis underscore controller go. This is where all the business logic when it comes to handling the resource or resources of type redis, all the business logic goes in here, right? So what I'm going to do when a resource of type redis is created, of course I'm going to go and try to find one, right? Because the event says the resource has been created. If I'm not able to find, maybe this is resource deletion. And if it is resource deletion, then let's see if it has got finalizers. If you're familiar with what finalizers are meant for in Kubernetes, it's basically to do some cleanup work before the resource is actually deleted from the cluster. In this case, I don't have any such complicated scenarios, but just for the sake of it, the methods exist. So if you have finalizer go and honor the finalizer before proceeding with the resource deletion. Now this is where it actually gets interesting. Line 159 when I'm actually requesting a deployment for my redis resource. Because like I was saying, redis as such, as a kind, as a type, means nothing to kubernetes, right? It's not a native Kubernetes object, right? I am these one who is providing it some meaning by the means of this controller, right? But end of the day it has to roll out into lower level Kubernetes constructs. That is, there has to be a deployment or a stateful set, there has to be pods, it has to be containers, right? And this is exactly what I'm trying to do here, is to basically try and find a deployment if it exists. If not, go ahead and create one, right? There is a bit of a reconciliation logic going on here as well, where I'm comparing these size as in like the size that I specify when I create my redis resource versus what is the observed state, right? How many number of replicas that I have and if there is a difference, if there's a drift, go ahead and correct that drift. So in a nutshell, my control loop as we can see here is kind of managing the whole lifecycle aspect of a resource type redis without even really revealing all these details to the end user. And that is what I was trying to stress upon, that when you deal with custom resources, when you deal with operators, when you deal with custom control loops, that's the whole value proposition of it, right? That you can really simplify how your workloads are actually represented to your consumers. Because kubernetes is hard, it's complicated and it's probably not a fair assumption to make that everybody but there is super familiar with all the lower level details and know the functionings of kubernetes. So how can you basically abstract it out, right? How can you basically make life simple for them without compromising on the Kubernetes best practices automation standards operators gives you that control, it gives you that way. And you can also get very opinionated, right? Because you can only expect abstractions to be created by your user and then you can basically build your controls in terms of what really happens when user actually end up requesting your custom resources, right? However you want to control it, you want to specify some specific security measures. For example, the container should not run as root, the container file system should be read only or this specific image tag may not be used or is not approved. So think about it, right? I mean, how far can you go with these operators and controllers? There is absolutely no limit. And the example here is a pretty simple one just stateless redis cache. But when you think about more complicated workloads like say postgres database in production, and think about all the operational aspect of a database, like starting from backup recovery snapshots, point in time recovery log archivals, everything becomes important, right? But you necessarily don't want to assume that your end user is a database administrator, right? So you can give them a bit of abstraction through these custom resources and then you basically transform them, translate them through your own operational expertise in that specific field, which is postgres or kubernetes in general. How do you want these resource to be handled? What should happen when a backup is requested? What should happen when a point in time recovery is requested? Things like that. So let's see it in action. Like I said, I have done a lot of stuff already before the talk. Well, for the Kubernetes cluster I'm using local kind cluster, though it's not a hard requirement, you are free to experiment with any other distribution of kubernetes, even if you have a cluster at your disposal from EKs, AKs or GKE. Please feel free to use that as long as you're authenticated to the cluster and you have cluster admin rights. For all the local development and experimentation purposes, I prefer using kind or minikube. They're pretty good. And for this demo I'm using a local kind cluster. So just want to make sure that my cluster is up and running and listening to the API request, which it does. So great. Now let's start from beginning and let us start the control loop here. So I'll be using some specific make commands to bootstrap my controller locally. So it's not going to run within the cluster, but it's going to run outside as a standalone process and would actually use the Kubeconfig file and its current context, which is an authenticated cluster admin user towards my kind cluster to make the API request, right? So I'm going to run this command, make, install, run. This is going to bootstrap my controller manager and that's my controller locally. All right, so it looks like my controller is up. Now I go to this terminal here in this directory. I have already created some sample manifest files. This is the file of our interest. So if you copy it and just look at it like what I'm really trying to do, just a very simple definition for my redis type of resource. I'm just requesting the cluster that hey, create a redis cache with three replicas, right? So you could see that I'm requesting redis, a resource of type redis, right? But when my controller will see it, it's actually going to translate it to a deployment with three replicas. And that is exactly what we want to verify. So let's do kubectl apply f sample cache yaml. Okay, so it says it created it. Now this resource was a namespace scoped resource. So let's verify if there is a resource of type redis namespacecon 42 in it. And there indeed is. Now let's see if it also created a deployment in these namespace which corresponds to the redis resource. And there is, you could see that there are three replicas, all three are healthy up and running. Right now, if I go ahead and delete this redis cache, this redis cache resource, what do you think should happen? It should not only delete the custom resource, but if you remember, we saw it here in the controller code towards the bottom. When I am setting up the control group with my controller manager that I'm specifying, it also owns the deployment that it creates. So that is very important. And it actually is reflected through this line of code right here when I'm creating the deployment before associating it with my redis resource set controller reference. If I don't set it, then this deployment will actually fall under the control of the control loop which is built into the controller manager inside of these Kubernetes control plane. And we don't want it to happen. We want the lifecycle of this deployment to be managed by this custom controller. Okay, so if we delete this and then see if I'm still able to find a deployment. No, these was found. So this way we were able to kind of tie up the lifecycle of the custom resource and everything that custom resource kind of rolled out into, in this case a Kubernetes deployment together. So that's on a high level how operators really work. Of course you can play around more than that, like try modifying the size, try deleting one of the replicas from the deployment and see what happens behind the scenes. But the idea is that, well, operators are there to provide you a mechanism to apply the operators knowledge that you have about a specific type of workload in a Kubernetes native way. Right? So you could do it in maybe several other different ways. We took the database example and of course those who have been into DBA roles, they understand pretty well what it means to take periodic backups, full backups, partial backups, snapshots and whatnot. Right? But end of the day, when you look at these systems running at large scale and especially in kubernetes, if kubernetes is offering you a way to eliminate this operational toil and codify this operational excellence, this operational knowledge that you have garnered over the years by operating on this specific type of workloads, you're going to be benefited. So that's all I had for these talk. Thank you so much for joining and I hope you enjoyed this talk. Again, thank you so much, have a good day.

Slides

Download slides (PDF)

See all 57 talks at this event!

Conf42 DevOps 2024 - Online

January 25 2024

Deep Dive into Kubernetes Operators: Learning the "Why", "How" & "What" of DevOps's Best Friend

Video size:

Abstract

Summary

Transcript

Slides

Rohit Mishra

Senior Customer Engineer @ Google Cloud

Join the community!

Featured event

2025

2024

Info

Conf42 DevOps 2024 - Online

January 25 2024

Deep Dive into Kubernetes Operators: Learning the "Why", "How" & "What" of DevOps's Best Friend

Video size:

Abstract

Summary

Transcript

Slides

Rohit Mishra

Senior Customer Engineer @ Google Cloud

Join the community!